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Preface 


The key result obtained by Fisher and Tippett in 1928 on the possible limit laws 
of the sample maximum has seemingly created the idea that extreme value theory 
was something rather special, very different from classical central limit theory. In 
fact, the number of publications dealing with statistical aspects of extremes dated 
before 1970 is at most a dozen. The book by E. J. Gumbel, published by Columbia 
University Press in 1958, has for a long time been considered as the main referential 
work for applications of extreme value theory in engineering subjects. A close look 
at this seminal publication shows that in the early stages one tried to approach 
extreme value theory via central limit machinery. During the decade following 
its appearance, no change occurred in the lack of interest among probabilists and 
statisticians who contributed only a very limited number of relevant papers. 

From the theoretical point of view, the 1970 doctoral dissertation by L. de 
Haan On Regular Variation and its Applications to the Weak Convergence of Sample 
Extremes seems to be the starting point for theoretical developments in extreme 
value theory. For the first time, the probabilistic and stochastic properties of sample 
extremes were developed into a coherent and attractive theory, comparable to the 
theory of sums of random variables. The statistical aspects had to wait even longer 
before they received the necessary attention. 

In Chapter 1, we illustrate why and how one should look at extreme values in 
a data set. Many of these examples will reappear as illustrations and even as case 
studies in the sequel. The next five chapters deal with the univariate theory for the 
case of independent and identically distributed random variables. Chapter 2 covers 
the probabilistic limiting problem for determining the possible limits of sample 
extremes together with the connected domain of attraction problem. The extremal 
domain of attraction condition is, however, too weak to use to fully develop useful 
statistical theories of estimation, construction of confidence intervals, bias reduc¬ 
tion, and so on. The need for second order information is illustrated in Chapter 3. 
Armed with this information, we attack the tail estimation problem for the Pareto 
case in Chapter 4 and for the general case in Chapter 5. All the methods developed 
so far are then illustrated by a number of case studies in Chapter 6. 

The last five chapters deal with topics that are still in full and vigorous devel¬ 
opment. We can only try to give a picture that is as complete as possible at the time 
of writing. To broaden the statistical machinery in the univariate case, Chapter 7 
treats a variety of alternative methods under a common umbrella of regression-type 


xi 
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methods. Chapters 8 and 9 deal with multivariate extremes and repeat some of the 
methodology of previous chapters, in more than one dimension. In the first of 
these two chapters, we deal with the probabilistic aspects of multivariate extreme 
value theory by including the possible limits and their domains of attraction; the 
next chapter is then devoted to the statistical features of this important subject. 
Chapter 9 gives an almost self-contained survey of extreme value methods in time 
series analysis, an area where the importance of extremes has already long been 
recognized. We finish with a separate and tentative chapter on Bayesian methods, 
a topic in need of further and deep study. 

We are aware that it is a daring act to write a book with the title Statistics of 
Extremes, the same as that of the first main treatise on extremes. What is even 
more daring is our attempt to cope with the incredible speed at which statistical 
extreme value theory has been exploding. More than half of the references in this 
book appeared over the last ten years. However, it is our sincere conviction that 
over the last two decades extreme value theory has matured and that it should 
become part of any in-depth education in statistics or its applications. We hope 
that this slight attempt of ours gets extreme value theory duly recognized. 

Here are some of the main features of the book. 

1. The probabilistic aspects in the first few chapters are streamlined to quickly 
arrive at the key conditions needed to understand the behaviour of sample 
extremes. It would have been possible to write a more complete and rigorous 
text that would automatically be much more mathematical. We felt that, for 
practical purposes, we could safely restrict ourselves to the case where the 
underlying random variables are sufficiently continuous. While more general 
conditions would be possible, there is little to gain with a more formal 
approach. 

2. Under this extra condition, the mathematical intricacies of the subject are 
usually quite tractable. Wherever possible, we provide insight into why and 
how the mathematical operations lead to otherwise peculiar conditions. To 
keep a smooth flow in the development, technical details within a chapter are 
deferred to the last section of that chapter. However, statements of theorems 
are always given in their fullest generality. 

3. Because of the lively speed at which extreme value theory has been develop¬ 
ing, thoroughly different approaches are possible when solving a statistical 
problem. To avoid single-handedness, we therefore included alternative pro¬ 
cedures that boast sufficient theoretical and practical underpinning. 

4. Being strong believers in graphical procedures, we illustrate concepts, deriva¬ 
tions and results by graphical tools. It is hard to overestimate the role of the 
latter in getting a quick but reliable impression of the kind and quality of 
data. 

5. Examples and case studies are amply scattered over the manuscript, some 
of them reappearing to illustrate how a more advanced technique results 
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in better insight into the data. The wide variety in areas of application as 
covered in the first chapter beautifully illustrates how extreme value theory 
has anchored itself in various branches of applied statistics. 

6. An extensive bibliography is included. This material should help the reader 
find his or her way through the bursting literature on the subject. Again, as the 
book is statistical in nature, many important contributions to the probabilistic 
and stochastic aspects of extreme value theory have not been included. 

The book has been conceived as a graduate or advanced undergraduate course 
text where the instructor has the choice of including as much of the theoretical 
development as he or she desires. We expect the reader to be familiar with basic 
probability theory and statistics. A bit of knowledge about Poisson processes would 
also be helpful, especially in the chapters on multivariate extreme value theory and 
time series. We have attempted to make the book as self-contained as possible. Only 
well-known results from analysis, probability or statistics are used. Sometimes they 
are further explained in the technical details. 

Software that was used in the calculations is available at http://www.wis. 
kuleuven. ac.be/stat/extreme .html#programs 

We take pleasure in thanking first Daan de Waal, Chris Ferro and Bjorn Vande- 
walle who have contributed substantially to the topics covered in the book. We are 
also very grateful to Tertius de Wet, Elisabeth Joossens, Alec Stephenson and Bjorn 
Vandewalle for furnishing additional material. A large number of colleagues pro¬ 
vided the data on which most of the practical applications and the case studies have 
been based. We thank in particular Jef Caers, Johan Fereira, Philippe Delongueville 
(SECURA), Robert Verlaak (AON), Robert Oger and Viviane Planchon and Jozef 
Van Dyck (Probabilitas). 

We are very obliged to our universities as well. Especially, the members of 
and visitors from the University Center for Statistics, Catholic University, Leuven, 
deserve our great appreciation. We are particularly grateful for the constructive 
input from the graduate students who have followed courses partially based on 
material from the book. 

Our contact with J. Wiley through Sian Jones and Rob Calver has always been 
pleasant and constructive. 

Our main gratitude goes to our loving families, who have endured our preoc¬ 
cupation with this ever-growing project for so long. 

Jan BEIRLANT 
Yuri GOEGEBEUR 
Johan SEGERS 
Jozef L. TEUGELS 
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WHY EXTREME VALUE 
THEORY? 


1.1 A Simple Extreme Value Problem 

Many statistical tools are available in order to draw information concerning specific 
measures in a statistical distribution. In this textbook, we focus on the behaviour of 
the extreme values of a data set. Assume that the data are realizations of a sample 
X\, X 2 ,..., X n of n independent and identically distributed random variables. The 
ordered data will then be denoted by X\^ n < < X n , n . The sample data are 

typically used to study properties about the distribution function 

F(x) = P(X < x), 

or about its inverse function, the quantile function defined as 

Q(P) ••= inf{x : F(x) > p). 

Suppose we would like to examine the daily maximal wind speed data in 
the city of Albuquerque shown in Figure 1.1 (taken from Beirlant et al. (1996a)). 
In the classical theory, one is often interested in the behaviour of the mean or 
average. This average will then be described through the expected value E(X) of 
the distribution. On the basis of the law of large numbers, the sample mean X 
is used as a consistent estimator of E(X). Furthermore, the central limit theorem 
yields the asymptotic behaviour of the sample mean. This result can be used to 
provide a confidence interval for E (X) in case the sample size is sufficiently large, a 
condition necessary when invoking the central limit theorem. For the Albuquerque 
wind speed data, these techniques lead to an average maximum daily wind speed of 
21.65 miles per hour, whereas (21.4-21.9) is a 95% confidence interval for E(X) 
based on the classical theory. 
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where is the i-th ordered sample value. For the Albuquerque data, this leads 
to p= 1-F W (30) = 0.18. 

However, we should add some critical remarks to these considerations. What if 
the second moment E(X 2 ) or even the mean E(X) is not finite? Then the central 
limit theorem does not apply and the classical theory, dominated by the normal 
distribution, is no longer relevant. Or, what if one wants to estimate p = P(X > x), 
where x > x n ^ n and the estimate p defined above yields the value 0? Such questions 
concerning the shed are important since the damage caused by extreme wind speeds 
can be substantial, perhaps even catastrophical. Clearly, we cannot simply assume 
that such x-values are impossible. However, the traditional technique based on the 
empirical distribution function, does not yield any useful information concerning 
this type of question. In terms of the empirical quantile function 

Qn(p) ：= inf{x : F n (x) > p], 

. .A 1 

problems arise when we consider high quantiles Q n (l — p) with p < -. 

These observations show that it is necessary to develop special techniques 
that focus on the extreme values of a sample, on extremely high quantiles or on 
small tail probabilities. In practical situations, these extreme values are often of 
key interest. The wind speed example provides just one illustration but there are 
numerous other situations where extreme value reasoning is of prime importance. 

1.2 Graphical Tools for Data Analysis 

Given data, a practitioner wants to use graphics that will show in a clear and effi¬ 
cient way the features of the data that are relevant for a given research question. In 
this section, we concentrate on visually oriented statistical techniques that provide 
as much information as possible about the tail of a distribution. In later chapters, 
these graphical tools will help us to decide on a reasonable model to describe the 
underlying statistical population. Our emphasis will not be on global models that 
aim at describing the data in their entirety or the distribution on its full support. 
Rather, we perform statistical fits above certain (high) thresholds. Motivation for 
this was provided in Section 1.1. 

We will not recapitulate common statistical graphics such as histograms, smooth 
density estimates and boxplots. Instead we will focus on quantile-quantile (QQ) 
and mean excess (or mean residual life) plots, which are often more informative 
for our purposes. Moreover, many popular estimation methods from extreme value 
theory turn out to be directly based on these graphical tools. 

1.2.1 Quantile-quantile plots 

The idea of quantile plots, or more specifically Quantile-Quantile plots (shortly 
QQ-plots), has emerged from the observation that for important classes of distri¬ 
butions, the quantiles Q(p) are linearly related to the corresponding quantiles of 
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a standard example from this class. Linearity in a graph can be easily checked by 
eye and can further be quantified by means of a correlation coefficient. This tool 
could therefore be ideally used when trying to answer the classical goodness-of-fit 
question: does a particular model provide a plausible fit to the distribution of the 
random variable at hand? 

Historically, the normal distribution has provided the prime class of models 
where 22-plots constitute a powerful tool in answering this question. As will be 
shown in the sequel, the exponential distribution plays a far more important role 
for our purposes. The rationale for 22-plots remains the same but the calculations 
are even easier. We start by explaining and illustrating the 22-plot idea for the 
exponential model Exp (X) (see Table 1.1). This very same methodology can then 
be explored and extended in order to provide comparisons of empirical evidence 
available in the data when fitting models such as the log-normal, Weibull or others. 

Restricting our attention first to the Exp ( 入） model, we can propose the standard 
exponential distribution 

1 — F\(x) := exp(—x), x > 0 

as the standard example from the class of distributions with general survival 
function 

1 — Fx(x) = exp(—A.x). 

We want to know whether the real population distribution F belongs to this class, 
parametrized by 入 > 0. The answer has to rely on the data x\,... ,x n that we have 
at our disposal. It is important to note that this parameter value can be considered 
as a nuisance parameter here since its value is not our main point of interest at this 

Table 1.1 22-plot coordinates for some distributions. 
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moment. We might even wonder whether this parameter has any relevance at all 
for modelling reality since the whole parametric model itself is still at question. 
The quantile function for the exponential distribution has the simple form 

Qx(p) = — I log(l — /?)，for p e (0, 1). 

Hence, there exists a simple linear relation between the quantiles of any exponential 
distribution and the corresponding standard exponential quantiles 

Qx(p) = ^Qiip) for p e (0, 1). 

入 

Starting with a given set xi ， X 2 , the practitioner replaces the unknown 

population quantile function Q by the empirical approximation Q n defined below. 
In an orthogonal coordinate system, the points with values 

(- log(l - p), Qnip)) 


are plotted for several values of p g (0, 1). We then expect that a straight line 
pattern will appear in the scatter plot if the exponential model provides a plausible 
statistical fit for the given statistical population. When a straight line pattern is 
obtained, the slope of a fitted line can be used as an estimate of the parameter A. -1 . 
Indeed, if the model is correct, then the equation 

Qx(p) = |(- log(l - p)) 

入 

holds. Remark that the intercept for the given model should be 0 as 2^(0) = 0. 
In general, 

a i — 1 i 

Qn(p) = Xi, n , for - < P . 

n n 

A very practical choice of values of p is given by 

12 n - 1 

p G ^ , - 

n n n 

The alternative choice 

1 一 .5 2 — .5 fi — 1 — .5 Ti — 0.5 

G , ,, , 

n n n n 

applies a continuity correction in that we compare a discontinuous function Q n 
with the continuous function Q\{x) := — log(l — x). Moreover, this choice avoids 
overflow problems at p = 1. The same holds for the choice 

1 2 n — l n 

w+l’n + 1’ n + l’w+1 




P € 
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While other choices are found in the literature, the latter option pi， n := i/{n + 1), 
i = 1,2，...，"，will be used in the sequel. The empirical quantiles are plotted on 
the vertical axis and the standard exponential quantiles on the horizontal axis. 

A straight line can be fitted through the scatter plot using a classical least- 
squares algorithm. This straight line with slope a and intercept 0 is obtained from 
minimizing the sum of squares 


n 

(xi,n + a log(l - /?/,„)). 



This procedure yields the well-known formula for the least-squares fit 


a = 


K 


where we have put 


qi,n .. = _ k)§(l _ Pi,n) ， 支 = 1 ， 2, . • . ， 


The fitted straight line can then be used as a tool to visually check the linear 
structure of the scatter plot. Moreover, we get an estimate of the parameter value 
A. if the linearity has been satisfactorily fulfilled. 

The exponential QQ-plot has a further important interpretation. The function 
that is approximated when plotting 

- log (1 - p “ n )) ， / = 1 ，…， n 

is given by 

x — log(l — F(x)). 

This is precisely the function that maps a random variable X with a continu¬ 
ous distribution function F into the standard exponential distribution. Indeed, the 
distribution function of — log(l — F(X)) is given by 

P(- log(l - F(X)) <x) = P(X < Q(l- exp(-x)))= 

= F(Q(1 - exp(-x))) = 1 - exp(-x) 

and so — log(l — F(X)) 〜 Exp(l). 

Often data are only available above a certain threshold t. For instance, a reinsur¬ 
ance company might only receive information about claims larger than a priority t. 
In case of an exponential model, this operation of conditioning on the event (X > t) 
leads to a shifted exponential model. Evaluation of the distribution of data that are 
larger than t coincides with the conditional distribution of X, given (X > t). Now, 
in case of an exponential model, 

P(X > x) 

P(X > x \ X > t) = -- = exp (—入 (x — ，))， x > t. 
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Figure 1.2 Exponential 22-plot: estimation of extreme quantiles. 


The corresponding quantile function is then equal to 

Q{p) = t - \ log(l - p), 0 < /) < 1. 

入 

As a result, the exponential QQ-plot introduced above will show an intercept t at 
the value p = 0. 

Suppose that on the basis of an exponential QQ-plot, a global exponential 
fit appears appropriate. Then we can answer a crucial question in extreme value 
analysis that has been mentioned before: the estimation of an extreme quantile 
2(1 — p) with p small is given by 

> 1 

q P = t ~ -\og(p). 

X 

Conversely, a small exceedance probability p = P(X > x\X > t) will be esti¬ 
mated by 

Px = exp (- 又 O - 0). 

Here 又 can be obtained from the least-squares regression on the exponential QQ- 
plot, or just by taking the maximum likelihood estimator X = l/(x — t). This is 
illustrated in a graphical way in Figure 1.2. 


Example 1.1 For a practical example of the above method, let us look at the daily 
maximal wind speed measurements obtained in Zaventem, Belgium (Figure 1.3). 
We restrict ourselves to data above / = 82 km/hr. We see that the histogram shows 
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Figure 1.3 Daily maximal wind speed measurements in Zaventem (Belgium) 
from 1985 till 1992. (a) histogram for all data, (b) conditional histogram for wind 
speeds larger than 82 km/hr with fitted exponential density superimposed ( 又 = 
l/(x — 82)) and (c) exponential 22-plot for wind speeds larger than 82 km/hr. 
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an exponentially decreasing form. The global exponential quantile plot shows a 
straight line pattern with intercept 82. A similar behaviour but with different values 
of 入 has been shown to hold for most cities in Belgium. 

The amount of evidence for the global goodness-of-fit can be measured by 
means of the correlation coefficient 


T!i=\( x i,n -x)(qi,n - q) 



where 

n 

x = n 〉: Xi， n 
i=l 

and where 

n 

= 〉: qi,n. 

i=l 

The quantity yq always satisfies the inequality 0 < rg < 1. Indeed, since the 
and the qi, n are increasing, the correlation coefficient will be non-negative. More¬ 
over, rg = 1 if and only if all the points lie perfectly on a straight line. Therefore, 
rQ can be used as a measure of global fit of the exponential model to the data. A 
formal significance test can be based on the statistic rQ and rejects the hypothesis 
of exponentiality when the number obtained differs too much from the value 1. 
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Equivalently, the value obtained can be compared with that of some tabulated 
critical value. 

We summarize our findings for the case of an exponential 22-plot as this will 
help us in pinning down our objectives for the case of general 22-plots. We denote by 
Q s the quantile function of the standard distribution from a given parametric model. 

In order to accept a proposed model as a plausible population model: 

(i) start from a characterizing linear relationship between (an increasing function 
of) the theoretical quantiles Q(p) from the proposed distribution and the 
computable quantiles Q s (p)', 

(ii) replace the theoretical quantiles Q(p) by the corresponding empirical quan¬ 
tiles Q n (p); 

(iii) plot the (increasing function of the) empirical quantiles Q n „ 

against the corresponding specific quantiles Q s (^j); 

(iv) inspect the linearity in the plot, for instance, by performing a linear regres¬ 
sion on the 22-plot and by investigating the regression residuals and the 
correlation coefficient. 

Strong linearity implies a good fit. Quantiles and return periods can then be esti¬ 
mated from the linear regression fit y = b ax on the 22-plot: 

/v _ ( x i,n —-^) Qs(Pi,n) 

E'! =l (Q s (Pi,n) - q ) 2 ' 
b = x — aq, 

where q = l/n Q s {Pi,n)- Indeed, q p = b aQ s (l — p) can be used for the 
estimation of the extreme quantiles. Further, p x = F s ((x — b)/a) with F s , the 
inverse function of Q s , serves as an estimate for the exceedance probability. 

22-plots can be used in cases more general than the exponential distribu¬ 
tion discussed above. In fact, they can be used to assess the fit of any statistical 
model. Some other important cases are given below (see Table 1.1 for the 22-plot 
coordinates): 

• The normal distribution. The coordinates of the points on a normal 22-plot 
follow immediately from the representation for normal quantiles 

Q(P) = M + cr^" 1 ^) 

where O -1 denotes the standard normal quantile function. 

• The log-normal distribution. Since log-transformed log-normal random vari¬ 
ables are normally distributed, log-normality can be assessed by a normal 
22-plot of the log-transformed data 

(O log i = 1,..., n. 
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• The Pareto distribution. The coordinates of the points on a Pareto 22-plot 
follow immediately from the exponential case since a log-transformed Pareto 
random variable is exponentially distributed. 

• The Weibull distribution. The quantile function of the Weibull distribution 
(cf. Table 1.1) is given by 

Q(p) = (-士 log(l - p) 

or equivalently, after a log transformation by 

1 1 1 

log Q(p) = - log - + - log (- log(l - p)). 
r 入 r 

This then yields as coordinates for the Weibull QQ-plot 
(log (- log(l - logX/, w ) , / = 1, • . . ， 

Example 1.2 Our next example deals with the Norwegian fire insurance data 
already treated in Beirlant et al. (1996a). Together with the year of occurrence, 
we know the values (x 1000 Krone) of the claims for the period 1972-1992. A 
priority of 500 units was in force. The time plot of all the claim values is given 
in Figure 1.4(a). We will concentrate on the data from the year 1976 for which 
Figure 1.4(b) shows a histogram. To assess the distributional properties for 1976, 
we constructed exponential and Pareto 22-plots, see Figures 1.4(c) and (d) respec¬ 
tively. The points in the exponential 22-plot bend upwards and exhibit a convex 
pattern indicating that the claim size distribution has a heavier tail than expected 
from an exponential distribution. Apart from the last few points, the Pareto QQ- 
plot is more or less linear indicating a reasonable fit of the Pareto distribution to 
the tail of the claim sizes. At the three largest observations the Pareto model does 
not fit so well. 

The distribution functions considered so far share the property that 22-plots 
can be constructed without knowledge of the correct model parameters. In fact, 
parameter estimates can be obtained as a pleasant side result. This is particularly 
true for location-scale models where the intercept of the line fitted to the 22-plot 
represents location while the slope represents scale. Unfortunately, this property 
does not extend to all distributions. In such cases, the construction of 22-plots 
involves parameter estimation. 

To deal with this more general case, consider a random variable X with distribu¬ 
tion function F$, where 0 denotes the vector of model parameters. To evaluate the 
fit of Fq to a given sample X\,..., X n using QQ-plots, several possibilities exist. 
A straightforward approach is to compare the ordered data with the corresponding 
quantiles of the fitted distribution, i.e., plotting 

{pi,n\ 足 ’，《)， / = 1， . . . ， W ， 
where 0 denotes an estimator for 0 based on Zi,..., X n . 




12 


WHY EXTREME VALUE THEORY? 


72 74 76 78 80 82 84 86 88 90 92 

Time 

(a) 



0 2000 4000 6000 8000 10 000 

Claim size 


(b) 


Alternatively, one can construct probability-probability or PP-plots which refer 
to graphs of the type 

Pi^n) i / = 1， . . • ， W ， 

or 

(1 _ 1 _ Pi ， n) ， / = 1， • . . ， 
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Figure 1.4 (a) Time plot for the Norwegian fire insurance data, (b) histogram, 
(c) exponential 22-plot and (d) Pareto 22-plot for the 1976 data. 
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where Ui, n ，i = 1,..., n denotes the set of order statistics of a random sample of 

x> ... 

size n from the U(0, 1) distribution and where = denotes equality in distribution. 

One can then return to the exponential framework by transforming the data first 
to the exponential case followed by a subsequent assessment of the exponential 
quantile fit. The quantities 

Ei，n = — log(l — ^d， w ))，/ = 1, . • • , 

are then the order statistics associated with a random sample of size n from the 
standard exponential distribution. Hence, another natural procedure to assess the 
fit of Fq to Xi, ..., is to construct 

(- log(l - Pi ， n\- log(l - , / = 1 ， ... ，”， 

and to inspect the closeness of the points to the first diagonal. Such a plot is 
sometimes referred to as a W-plot. Of course, in all these plots the coordinates can 
be reversed. 

1.2.2 Excess plots 

The probabilistic operation of conditioning a random variable X on the event 
(X > t) is of major importance in actuarial practice, especially in reinsurance. 
Take an excess-of-loss treaty with a retention t on any particular claim in the 
portfolio. The reinsurer has to pay a random amount X — t but only if X > t. 
When an actuary wants to decide on a priority level t through simulation, he needs 
to calculate the expected amount to be paid out per client when a given level t 
is chosen. This then is an important first step in deciding on the premium. For 
instance, the net premium principle depends on the mean claim size E(X). For the 
overshoot, the actuary will calculate the mean excess function or mean residual life 
function e 

e(t) = E(X -t\X>t) 


assuming that for the proposed model, E(X) < oo. In the whole of extreme value 
methodology, it is natural to consider data above a specified high threshold. 

In practice, the mean excess function e is estimated by e n on the basis of a 
representative sample xi ,..., x n . Explicitly, 


e n (t)= 


—i x i l( ，， oo)fe) 
l(r,oo)te) 


—t ， 


where l( f)00 )(x,) equals 1 if Xi > t, and 0 otherwise. This expression is obtained 
by replacing the theoretical average by its empirical counterpart, i.e., by averaging 
the data that are larger than t and subtracting t. 

Often the empirical function e n is plotted at the values t = x n —k, n , k = 1,..., 
n _ 1 ， the (k + 1)-largest observation. Then the numerator equals Xi 1(?, oo) 
O/) = Yl k j=i x n-j-{-i,n, while the number of xi larger than t equals k. The estimates 
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of the mean excesses are then given by 




^ni^n—k,n) = ~T 〉 ： ^n—j-\-\,n 


_ ^n—k,n • 


(l.i) 


In this section, variations on the mean excess function and its empirical coun¬ 
terparts will be examined from the viewpoint of their statistical applications. But 
first we need to understand better the behaviour of the theoretical mean excess 
function. This will help us link empirical shapes with specific theoretical models. 
In Chapter 4, we will see how the statistic ^ w ,i 0 g x (log X n -k,n) with 咨 n ， i og x denot¬ 
ing the empirical mean excess function of log-transformed data, appears as the Hill 
estimator. 

The calculation of e for a random variable with survival function l — F starts 
from the formula 


eit)= 


f t x+ (l-F(u))du 
^1 - F(t )^ 


( 1 . 2 ) 


where = sup{x : F(x) < 1} is the right endpoint of the support of F. The 
derivation of this alternative formula goes as follows. Apply Fubini’s theorem to 
write 



(x — t) dF(x)= 



dy dF(x) 



dF(x)= 



(1- F(y)) dy. 


(1.3) 


One can also derive an inverse relationship, indicating how one calculates F 
from e. This then shows that e uniquely determines F. Indeed, from relation (1.2) 


/o e(u) 


- du 


1 - F(u) 


du 


Jo J^(l-F(v))dv 

- r^log 广 (1-F ⑻) du 

JO Ju 

log / (1 — F(v)) dv — log / (1 — F(v)) dv 

log(e(0)(l - F(0)))-log((l- 


so that 

1 - F(t) e(0) / 1 

- = - exp — / - au 

1-F(0) e(t) ' J 0 e(u) 

When considering the shapes of mean excess functions, again the exponential 
distribution plays a central role. A characteristic feature of the exponential dis¬ 
tribution is its memoryless property, meaning that whether the information X > t 
is given or not, the outcome for the average value of X — Ms the same as if 
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one started at ， = 0 and calculated E(X). The mean excess function e for the 
exponential distribution is constant and given by 

1 

e(t) = — for all t > 0. 

X 

When the distribution of X has a heavier tail than the exponential distribution 
(HTE), then we find that the mean excess function ultimately increases, while for 
lighter tails (LTE) e ultimately decreases. For example, the Weibull distribution 
with 1 — F{x) = exp (— 入 x T ) satisfies the asymptotic expression 

t l ~ r 
e(t) = — 

入 r 

yielding an ultimately decreasing (resp. increasing) e in case r > 1 (resp. r < 1). 
Hence, the shape of e yields important information on the LTE or HTE nature of 
the tail of the distribution at hand. The graphs of e for some well-known distribu¬ 
tions are sketched in Figure 1.5. 

Plots of empirical mean excess values ek, n as introduced in (1.1) can be con¬ 
structed in two alternative ways, i.e., ek, n versus k, or ek, n versus x n —k, n . Remem¬ 
bering the discussion in the preceding subsection, one looks for the behaviour of 
the plotted ek, n values for decreasing k values or for increasing x n —k, n values. In 
case of the wind speed data from Zaventem, the constant behaviour becomes appar¬ 
ent from the plot in Figure 1.6. On the other hand, data from the 1976 Norwegian 
fire insurance example show a HTE pattern (see Figure 1.7). 



Figure 1.5 Shapes of some mean excess functions. 
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Apart from being an estimate of the function e at a specific value x n —k， n , the 
quantity ek, n can also be interpreted as an estimate of the slope of the exponential 
22-plot to the right of a reference point with coordinates (— log (^}) ， x n -k, n )- 
Here, we make use of the continuity correction in the 22-plot. Indeed, this slope 
can be estimated by the ratio of the differences in the vertical and horizontal 
coordinates between the remaining points and the reference point itself: 



\ ^n-j-\-l,n ~ X n -k,n 

-i lo 8 (^t) +!°g ( 黑 ) 


In this expression, the denominator can be interpreted as an estimator of the mean 
excess function of the type taken at — log and based on the standard 
exponential (theoretical) quantiles — log(l — p) with p = l — pj n , j = l,k. 
The denominator hence is an approximation of the mean excess function of the 
standard exponential distribution Exp(l) as in (1.1) and hence is approximately 
equal to 1. Using Stirling’s formula, one can even verify that this is a very precise 
approximation even for small values of k. Hence we find that Ek, n constitutes an 
approximation of 它 k,n. 

The above discussion explains the ultimately increasing (respectively, decreas¬ 
ing) behaviour of the mean excess function for HTE distributions (respectively, 
LTE distributions). In case of a HTE distribution, the exponential 22-plot has a 
convex shape for the larger observations and the slopes continue to increase near 
the higher observations. This then leads to an increasing mean excess function. A 
converse reasoning holds for LTE distributions. Illustrations for this principle are 
sketched in Figure 1.8. 


1.3 Domains of Applications 

1.3.1 Hydrology 

The ultimate interest in flood frequency analysis is the estimation of the T -year 
flood discharge (water level), which is the level exceeded every T years on average. 
Here, a high quantile of the distribution of discharges is sought. Usually, a time 
span of 100 years is taken, but the estimation is mostly carried out on the basis 
of flood discharges for a shorter period. Consequences of floods exceeding such 
a level can be disastrous. For example, the 100-year flood levels were exceeded 
by the American flood of 1993 and caused widespread devastation in the states in 
the Mid-West. The floods in the Netherlands in 1953 were really catastrophic and 
triggered the Delta Plan of dike constructions, still of interest in that country. By 
law, dikes in the low-land countries, Belgium and the Netherlands, should be built 
as high as the 10 4 -year flood discharge. 

Another hydrological parameter for which the tail of the corresponding dis¬ 
tribution is of special interest is rainfall intensity. This parameter is important in 
modelling water course systems, urban drainage and water runoff. Clearly, the 
effective capacity of such systems is determined by the most extreme intensities. 
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Quite often only periodic, even annual, maxima are available. Then, alternative 
to T -year water levels, the conclusions of an extreme value analysis are requested 
in terms of a return period. The latter is expressed in terms of the reciprocal of 
the survival function of the periodic maxima, say Y, 


T(x) = 


P(Y > x) 


Later, we will describe how the concept of return period can easily be adapted to 
cases where the distribution of values of Y is studied. 


Case study: 

Annual maximal river discharges of the Meuse river from 1911 till 1996 at 
Borgharen in Holland. The time plot of the annual maximal discharges is given 
in Figure 1.9(a). In order to get an idea about the tail behaviour of the annual 
maxima distribution, an exponential QQ-plot was constructed, see Figure 1.9(b). 
As is clear from this QQ-plot, the annual maxima distribution does not exhibit 
a HTE tail behaviour. This is also confirmed by the mean excess plots given in 
Figures 1.9(c) and (d). 

1.3.2 Environmental research and meteorology 

Meteorological data generally have no alarming aspects as long as they are situated 
in a narrow band around the average. The situation changes for instance when 
concentrations occur that overshoot a specific ecological threshold like with ozone 
concentration. Rainfall and wind data provide other illustrations with tremendous 
impact on society as they are among the most common themes for discussion. Just 
recall the questions concerning global warming and climate change. Typically, one 
is interested in the analysis of maximal and minimal observations and records over 
time (often attributed to global warming) since these entail the negative conse¬ 
quences. 

Case studies: 

(i) Wind speed database provided by NIST, Gaithersburg, consisting of daily 
fastest-mile speeds measured by anemometers situated 10 m above the 
ground. The data have been filed for a period of 15 to 26 years and concern 
49 airports in the US over a period between 1965 and 1992. Wind speeds from 
hurricanes and tornadoes have not been incorporated. We select three cities 
from the study. Table 1.2 gives the length of the data and their observation 
periods. Figure 1.10 represents the corresponding boxplots. 


(ii) Daily maximum temperatures at Uccle, Belgium. The data plotted in 
Figure 1.11 are daily maximum surface air temperatures recorded in degrees 
Celsius at Uccle, Belgium. These data were gathered as part of the European 
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Table 1.2 Wind speed database. 


City 

State 

Length 

Period 

Albuquerque 

New Mexico 

6939 

1965-，83 

Des Moines 

Iowa 

5478 

1965-，79 

Grand Rapids 

Michigan 

5478 

1965-’79 
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Figure 1.11 Time plot of the daily maximum temperature at Uccle, Belgium. 

Climate Assessment and Dataset project (Klein Tank and co-authors (2002)) 
and are freely available at www. knmi . nl/samenw/eca. 

1.3.3 Insurance applications 

One of the most prominent applications of extreme value thinking can be found 
in non-life insurance. Some portfolios seem to have a tendency to occasionally 


Albuquerque Des Moines Grand Rapids 

Figure 1.10 Boxplots of the daily fastest-mile wind speeds. 
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include a large claim that jeopardizes the solvency of a portfolio or even of a 
substantial part of the company. Apart from major accidents such as earthquakes, 
hurricanes, airplane accidents and so on, there is a vast number of occasions where 
large claims occur. Once in a while, automobile insurance leads to excessive claims. 
More often, fire portfolios encounter large claims. Industrial fires, especially, cause 
a lot of side effects in loss of property, temporary unemployment and lost contracts. 

An insurance company will always safeguard itself against portfolio contami¬ 
nation caused by claims that should be considered as extreme rather than average. 
In an excess-of-loss reinsurance contract, the reinsurer pays for the claim amount 
in excess of a given retention. The claim distribution is therefore truncated to the 
right, at least from the viewpoint of the ceding company. The estimation of the 
upper tail of the claim size distribution is of major interest in order to determine 
the net premium of a reinsurance contract. Several new directions in extreme value 
theory were influenced by methods developed in the actuarial literature. 

Case studies: 


(i) The Secura Belgian Re data set depicted in Figure 1.12(a) contains 371 auto¬ 
mobile claims from 1988 till 2001 gathered from several European insurance 
companies, which are at least as large as 1,200,000 Euro. These data were 
corrected among others for inflation. The ultimate goal is to provide the 
participating reinsurance companies with an objective statistical analysis in 
order to assist in pricing the unlimited excess-loss layer above an operational 
priority R. These data will be studied in detail in Chapter 6; here we use them 
only to illustrate some concepts introduced above. The exponential QQ-plot of 
the claim sizes is given in Figure 1.12(b). From this plot, a point of inflection 
with different slopes to the left and the right can be detected. This becomes 
even more apparent in the mean excess plots given in Figures 1.12(c) and 
(d): behind 2,500,000, the rather horizontal behaviour changes into a pos¬ 
itive slope. As we will see later, the mean excess function is an important 
ingredient for establishing the net premium of a reinsurance contract. 


(ii) The SOA Group Medical Insurance Large Claims Database. This database 
records, among others, all the claim amounts exceeding 25,000 USD over the 
period 1991-92 and is available at http://www. soa. org. There is no trunca¬ 
tion due to maximum benefits. The study conducted by Grazier and G，Sell 
Associates (1997), where a thorough description of these data can be found, 
collects information from 26 insurers. The 171,000 claims recorded are part 
of a database including about 3 million claims over the years 1991-92. Here 
we deal with the 1991 data. The histogram of the log-claim amounts shown 
in Figure 1.13(a) gives evidence of a considerable right-skewness. Further, 
the convex shape of the exponential quantile plot (Figure 1.13(b)) and the 
increasing behaviour of the mean excess plots (Figures 1.13(c) and (d)) in 
the largest observations indicate a HTE nature of the claim size distribution. 
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For reinsurers，the possible influence of covariate information like the sum 
insured and the type of building is of prime importance for premium differ¬ 
entiation according to the risk involved. 

(iv) Loss-ALAE data studied by Frees and Valdez (1998) and Klugman and Parsa 
(1999). The data shown in Figure 1.15 comprise 1,500 general liability claims 
(expressed in USD) randomly chosen from late settlement lags and were pro¬ 
vided by Insurance Services Office, Inc. Each claim consists of an indemnity 
payment (loss) and an allocated loss adjustment expense (ALAE). Here, ALAE 
are types of insurance company expenses that are specifically attributable to 
the settlement of individual claims such as lawyers' fees and claims investi¬ 
gation expenses. In order to price an excess-of-loss reinsurance treaty when 
the reinsurer shares the claim settlement costs, the dependence between losses 
and ALAE y s has to be accounted for. Our objective is to describe the extremal 
dependence. 

1.3.4 Finance applications 

Financial time-series consist of speculative prices of assets such as stocks, foreign 
currencies or commodities. Risk management at a commercial bank is intended to 
guard against risks of loss due to a fall in prices of financial assets held or issued 
by the bank. It turns out that returns, that is, the relative differences of consecutive 
prices or differences of log-prices, are the appropriate quantities to be investigated; 
see for instance, Figures 1.16(a) and (b), where we show the time plot of the 
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Figure 1.15 Loss-ALAE data: scatterplot of loss vs ALAE. 

Standard & Poors 500 closing values and daily percentage returns respectively, 
from January 1960 up to 16 October 1987, the last market day before the big 
crash of Black Monday, 19 October 1987. For more about financial applications, 
we refer to the book by Embrechts et al. (1997). 

The Value-at-Risk (VaR) of a portfolio is essentially the level below which the 
future portfolio will drop with only a small probability. VaR is one of the important 
risk measures that have been used by investors or fund managers in an attempt to 
assess or predict the impact of unfavourable events that may be worse than what 
has been observed during the period for which relevant data are available. 

1.3.5 Geology and seismic analysis 

Applications of extreme value statistics in geology can be found in the magnitudes 
of and losses from earthquakes, in diamond sizes and values, in impact crater size 
distributions on terrestrial planets, and so on; see for instance Caers et al. (1999a, 
1999b). The importance of tail characteristics of such data can be linked to the 
interpretation of the underlying geological process. 

Case studies: 

(i) Pisarenko and Sornette (2003) analysed shallow earthquakes (depth <70 km) 
in the Harvard catalog over the period 1977-2000. In this study, the tails of 
the seismic moment distributions for subduction and mid-ocean ridge zones 
are compared. The database contains seismic moment measurements (in dyne- 
cm) of 6458 earthquakes in subduction zones and 1665 earthquakes in mid¬ 
ocean ridge zones. For both zones, the seismic moment distributions are of 
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HTE type as indicated by the exponential quantile and mean excess plots 
given in Figure 1.17. 

(ii) In agriculture，soil analysis is the basis of fertilizer and amendment recom¬ 
mendations in the context of managing soil fertility and crop performance. 
Fertilizers are used to meet crop demand for nutrients while amendments 
are necessary to stabilize and improve both soil structure and water infiltra¬ 
tion, and to optimize pH levels. Recently, a new concept of crop management, 
called precision farming has emerged. It permits within-fie Id variation of crop 
techniques, for instance, to adjust fertilizer inputs on the basis of soil sam¬ 
pling and soil analysis. As the development of these techniques increased the 
demand for soil data, laboratories are now burdened with large datasets. In 
this context, the Belgian non-profit organization REQUASUD (Reseau Qualite 
Sud i.e. South Quality Network) was created in 1989 to put efficient advices 
and analysis services at the practitioner’s disposal. REQUASUD developed 
a centralized soil database that contains more than 150,000 soil chemical 
composition (pHKCl ， K ， Mg, Ca, etc.) records. It also has information about 
sample origin (zip code)，soil texture, soil occupation, previous and recent cul¬ 
tures. The Unit of Geopedology (Gembloux Agricultural University, Belgium) 
is the reference laboratory for soil analyses and the database is centralized at 
the Unit of Biometry, Data Management and Agrometeorology (Agricultural 
Research Centre of Gembloux). Detailed studies of the data allow extension 
services to study physical and chemical properties of agricultural soils and to 
manage them according to their fertility potential and their ability to support 
cultures. 

The Condroz database contains calcium content and pH level measurements 
of 19,516 soil samples originating from different cities in the Condroz, a geo¬ 
graphical region in the southern part of Belgium. Figure 1.18 shows a map 
of Belgium in which the area covered by the data is grey coloured. For a 
detailed description of these data, we refer to Goegebeur et al. (2004). The 
data have been analysed with emphasis on the development of an automatic 
procedure for highlighting suspicious calcium measurements in order to guar¬ 
antee database quality. Our focus will be on the related issue of modelling 
extreme calcium measurements in terms of the covariates pH level and city. As 
is clear from the calcium versus pH scatter plot given in Figure 1.19(a), both 
variables are positively associated. Moreover, note that extreme calcium mea¬ 
surements tend to occur more often at the higher pH levels, indicating the need 
for describing the tail of the calcium distribution in terms of the covariate pH. 
Here, we comment on the tail behaviour of the calcium content distribution 
conditional on pH = 6.5. The convex shape of the exponential QQ-plot and 
(hence) the increasing mean excess function near the largest observations 
give evidence of a HTE-type tail behaviour, see Figures 1.19(b), (c) and (d). 

(Hi) Diamond data. The profitability of a diamond exploration heavily depends on 
the quality of the stones found in a particular area. In turn，the overall value 
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Figure 1.20 Diamond data. 

1.3.6 Metallurgy 

An important problem from the area of metallurgy that received wide attention 
is the estimation of the size of the largest inclusions in a metal as metal fatigue 
typically originates at very large inclusions. See, for instance, the special issue 
of Extremes, 1999, dedicated to this subject (Bomas et al. 1999, Murakami and 
Beretta 1999, Svensson and de Mare 1999). Here an interesting connection exists 
with Wicksell’s corpuscle problem when only the sizes of vertical sections of 
such ‘grains’ are measured. Wicksell (1925) gave the integral relation between the 
distribution of sizes of spheric objects and the distribution of vertical sections. 

Applying a cyclic loading on a metallic component may cause its failure even 
if the maximum stress is below the static strength limit of the material. This phe¬ 
nomenon is termed fatigue. Any material has a minimum stress range — called 
the fatigue strength — below which it can endure an indefinite number of cycles. 
However, fatigue properties of steel are strongly influenced by the presence of 
microscopic particles of oxides or foreign material known as inclusions. Fatigue 
strength increases with decreasing defect size and therefore, the size of the max¬ 
imum inclusion is an important indicator of the quality of a particular metallic 
component. It is infeasible to completely destruct a component in order to find 
its largest inclusion. Inference about this inclusion has to be based on a repre¬ 
sentative sample. In general, observations are taken on polished plane surfaces of 
samples of steel resulting in sizes of two-dimensional cross-sections of those inclu¬ 
sions that intersect the surface. This raises the additional problem to infer about 
three-dimensional sizes of large inclusions from data in two-dimensional surface 
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sections. For recent contributions, we refer to Drees and Reiss (1992), Takahashi 
and Sibuya (1996), Takahashi and Sibuya (1998), Anderson and Coles (2000) and 
Beretta and Anderson (2002). As an example we use the data from Anderson and 
Coles (2000). The data contain 112 surface diameters of inclusions on a polished 
surface above the threshold of 5(im. The units of measurement are taken to insure 
that the area of the measured surface would be 1. In Figure 1.21(a), we show 
the histogram of the surface diameters with a fitted exponential density function 
( 又 = l/(x — 5) = 1/1.548152) superimposed. The fit of the exponential distribu¬ 
tion can be further evaluated on the basis of the exponential 22-plot given in 
Figure 1.21(b) where the straight line shows the least-squares fit. 

Another important problem from metallurgy is the study of pit corrosion. Cor¬ 
rosion can lead to the failure of metal structure such as tanks or tubes. Extreme 
value analysis becomes relevant since pits of large depth are of primary interest. 

1.3.7 Miscellaneous applications 

Network traffic data exhibit properties that are inconsistent with traditional queue¬ 
ing models. In fact, next to several other unusual properties, the distribution of 
quantities such as transmission lengths, transmission rates, file sizes and CPU 
job completion times can be well modelled by Pareto-type laws which will be dis¬ 
cussed in Chapter 4. Some references include Guerin et al. (2000), Resnick (1997), 
Resnick and Rootzen (2000). 

Recently, models for old age mortality data received renewed attention. In fact, 
there is a debate on whether or not there is a fixed upper limit to the length of 
human life (see Thatcher (1999)). 

We can also refer to Zipf’s (1941, 1949) classic study of the dynamics of com¬ 
munity sizes where Pareto-type distributions are found again, see, for instance, 
Feuerverger and Hall (1999). In fact, Pareto laws were also observed for the distri¬ 
bution of biological genera, ranked by the number of species they contain (Willis 
(1922)), and for the distribution of word usage frequencies in numerous linguistic 
and literary contexts (Zipf (1935)). The book of Zipf (1941, 1949) presents an 
incredible variety of phenomena, including examples from economics, business, 
commerce, economical geography, industry, travel, communication, traffic, soci¬ 
ology, psychology, music, politics and warfare. In many of these examples, an 
approximate fit of Pareto distribution is rather convincing, particularly, in the tail. 

1.4 Conclusion 

As a conclusion, we find that the area of extreme value statistics, as statistics in 
general, offers a wide variety of problems. Aside from the classical problem of 
analysing the distribution of a single random variable on the basis of a random 
sample, we find data structures for which time-series models, regression and multi¬ 
variate settings are appropriate. After parametric and non-parametric approaches in 
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a frequentist approach, Bayesian parameter estimation techniques are also now in 
use. The goal of this text is to provide an introduction to each of those models and 
methods. To reach this goal, we provide in Part I the basic theoretical probabilistic 
and statistical background. In addition, we elaborate on some case studies from 
the list above. 



2 

THE PROBABILISTIC SIDE 
OF EXTREME VALUE THEORY 


Consider a random sample {X ? -, I < i < n] from a distribution F. In the preceding 
chapter, it was mentioned that in many situations, extreme value analysis is often (to 
be) built on a sequence of data that are block maxima, for instance, yearly maxima. 
A traditional statistical discussion on the mean is based on the central limit theorem 
and hence often returns to the normal distribution as a basis for statistical inference. 
The classical central limit theorem states that the distribution of 

^ ■ 广 ( 义 1 + ... + XJ j n — ^ X\ X n — nE(X) 

\ V var (^) / \!n var(X) 

converges for n ^ oo to a standard normal distribution. In general, the central 
limit problem deals with the sum S n := Xi + Z 2 + • • • + and tries to find 
constants a n > 0 and b n such that Y n := a~ l (S n — b n ) tends in distribution to a 
non-degenerate distribution. Once the limit is known, it can be used to approximate 
the otherwise cumbersome distribution of the quantity Y n . 

A first question is to determine what distributions can appear in the limit. 
Then comes the question for which F any such limit is attained. The answer 
reveals that typically the normal distribution is attained as a limit for this sum (or 
average) S n of independent and identically distributed random variables, except 
when the underlying distribution F possesses a too heavy tail; in the latter case, 
a stable distribution appears as a limit. Specifically, Pareto-type distributions F 
with infinite variance will yield non-normal limits for the average: the extremes 
produced by such a sample will corrupt the average so that an asymptotic behaviour 
different from the normal behaviour is obtained. 

In this chapter, we will be mainly concerned with the corresponding problem 
for the sample maximum rather than the average: we will consider both the possible 
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limits and the different ways to describe the sets of distributions from which sample 
maxima are converging to these limits. 

2.1 The Possible Limits 


In what follows, we will replace the sum S n by the maximum 


Xn,n = , X n }. 

Of course, we could just as well study the minimum rather than the maximum. 
Clearly, results for one of the two can be immediately transferred to the other 
through the relation 


^i,n — — max{—Xi, —X 2 , …， —X n }. 

It is natural to consider the probabilistic problem of finding the possible limit dis¬ 
tributions of the maximum X n , n . Hence, the main mathematical problem posed in 
extreme value theory concerns the search for distributions of X for which there exist 
a sequence of numbers {b n ; n > 1} and a sequence of positive numbers {a n ; n > 1} 
such that for all real values x (at which the limit is continuous) 

P ( Xn,n ~ tn <x] ^ G(x) (2.1) 

V a n ) 

as n —> 00 . The standardization with b n and a n appears natural since otherwise 
X n ^ n —a.s. It is required that the limit G should be a non-degenerate distri¬ 
bution; in fact any number can appear as the degenerate limit of (X n ^ n — a n )/b n 
whatever the underlying distribution. Again, the problem is twofold: (/) find all 
possible (non-degenerate) distributions G that can appear as a limit in (2.1); 
⑻ characterize the distributions F for which there exist sequences {a n ; n > 1} 
and {b n ; n > 1} such that (2.1) holds for any such specific limit distribution. 

The first problem is the (extremal) limit problem. It has been solved in Fisher 
and Tippett (1928), Gnedenko (1943) and was later revived and streamlined by de 
Haan (1970). Once we have derived the general form of all possible limit laws, 
we need to solve the second part of the problem, which is called the domain of 
attraction problem. This can be described more clearly in the following manner. 
Assume that G is a possible limit distribution for the sequence a~ l (X n ^ n — b n ). 
What are the necessary and sufficient conditions on the distribution of X to get 
precisely that limiting distribution function G. General and specific examples can 
easily illustrate the variety of distributions attracted to the different limits. The 
set of such distributions will be called the domain of attraction of G and is often 
denoted by V{G). 

Trying to avoid an overly mathematical treatment, we will not solve the above 
problem in its full generality. We rather provide a direct and partial approach to 
this problem, which works under the assumption that the underlying distribution 
possesses a continuous, strictly increasing distribution function F. 



THE PROBABILISTIC SIDE OF EXTREME VALUE THEORY 


47 


In contrast with the central limit problem, the normal distribution does not 
appear as a limiting distribution owing to the inherent skewness that is observed 
in a distribution of maxima. In this section, we will show that all extreme value 
distributions 


G y (x) = exp (—(1 + yx)~ x ^ Y ^ , for 1 + yx > 0, 

with y g M can occur as limits in (2.1). The real quantity y is called the extreme 
value index (EVI). It is a key quantity in the whole of extreme value analysis. 

In order to solve this general limit problem for extremes, we rely on a clas¬ 
sical concept from probability theory. Suppose that {Y n } is a sequence of random 
variables. Then we say that Y n converges in distribution or converges weakly to Y, 

if the distribution function of Y n converges pointwise to the distribution function 

v 

of Y, at least in all points where the latter is continuous. We write Y n Y. In 
probability theory, one often proves weak convergence by looking at the corre¬ 
sponding convergence of the characteristic functions. However, for our purposes, 
we will rely on another well-known result from probability theory, that is, the 
Helly-Bray theorem, see Billingsley (1995). This result transfers the convergence 
in distribution to the convergence of expectations. 

Theorem 2.1 Let Y n have distribution function F n and let Y have distribution 
. T> . 

function F. Then Y n ^ Y iff for all real, bounded and continuous functions z, 

E(z(Y n )) ^ E(z(Y)). 


For example, the sequence {Y n } satisfies a weak law of large numbers if the 
random variable Y is degenerate. Alternatively, by the Helly-Bray theorem Y n Y 

p 

degenerate in the constant c, iff E(z(Y n )) —> z(c). We will then write Y n =^- c. 

For the case of normalized maxima, the limit laws will depend on the crucial 
parameter y. For this reason, we include that parameter into the notation. So, put 

Y n := a~ l (X n ^ n — bn) and Y = Y y . We then have that Y n —> Y y iff for all real, 
bounded and continuous functions z, 

/»oo 

E {z (a~ l (X n ^ n - b n ))) / z(v)dGy(v) 

J —oo 

as n —> oo where G y (v) := P(Y y < v). 

The idea of the above equivalence is that convergence in distribution can be 
translated into the convergence of expectations for a sufficiently broad class of 
functions z. We go through the derivation of the extreme value laws as some of 
the intermediate steps are crucial to the whole theory of extremes. Therefore, let z 
be as above, a real, bounded and continuous function over the domain of F. First, 
note that 


P(x n , n <x) = P(n1 =l (Xi <x)) = Y[ P(Xi <X) = F n (x). 
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Therefore, we find that 


E{z (a-\x^ n -bn))} 



(x)dF(x). 


We could restrict the domain of the distribution F to the genuine interval of support. 
This is determined by the left boundary := supjx : F(x) = 0} and the right 
boundary \= inf{x : F(x) = 1}. Unless this is important, we will not specify 
these endpoints explicitly. 

Recall that F was supposed to be continuous. Therefore, we can set F(x)= 
1 — We solve this equation for x. The solution can be put in terms of the inverse 
function F^{y) = inf{x : F(x) > y], which of course equals the quantile function 
Q{y), or the tail quantile function U (y) = i 7 — (1 — +). Most often, we will use 
the prescription 

U{y) = g (1 — i) = x and F(x) = 1 - - . (2.2) 


Note in particular that = U (1) and that = U (oo) while U is non-decreasing 
over the interval [1, oo). The integral above therefore equals 





dv. 


Now observe that (1 — 吾广 1 ^ e~ v as n ^ oo while the interval of integration 
extends to the positive half-line. The only place where we still find elements of 
the underlying distribution is in the argument of z. Therefore, we conclude that 
a limit for E {z (a~ x (X n ^ n — b n ))} can be obtained when for some sequence a n 
we can make a~ l (U (n/v) — b n ) convergent for all positive v. It seems natural 
to think of i; = 1, which suggests that b n = Uin) is an appropriate choice. The 
natural condition to be imposed and crucial to all that follows is that for some 
positive function a and any u > 0, 


lim [U (xu) — U (x)} /a(x) =: h(u) exists , (C) 

x-^-oo 

with the limit function h not identically equal to zero. 

Let us pause to prove the following basic limiting result. 

Proposition 2.2 The possible limits in (C) are given by 

ch Y {u) = c f v y ~ l dv — c- - - 

Ji Y 

where c > 0, y is real and where we interpret ho(u) — log u. 

The case where c = 0 is to be excluded since it leads to a degenerate limit 
for (X n ^ n — b n )/a n . Next, the case c > 0 can be reduced to the case c = 1 by 
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incorporating c in the function a. Hence, we replace the condition (C) by the more 
informative extremal domain of attraction condition 

lim {U(xu) — U (x)} /a(x) =: h y (u) for u > 0. (C y ) 

x^-oo 

indicating that the possible limits are essentially described by the one-parameter 
family h y . If necessary, we will even explicitly refer to the auxiliary function a 
by referring to C Y (a). We will say that the underlying distribution F satisfies the 
extreme value condition C y (a) if condition (C y ) holds with the auxiliary function 
a. At instances, we will also assume functions other than the tail quantile function 
U to satisfy C Y (a). 

Let us prove the above proposition. Let u, v > 0, then 

U(xuv) — U(x) U(xuv) — U(xu) a{ux) U(xu) — U (x) 

- = - 1 - . (2.3) 

a(x) a(xu) a(x) a(x) 

If we accept that the above convergence condition (C) is satisfied, then automat¬ 
ically the ratio a(ux)/a(x) has to converge too. Let us call the limit g(u). Then 
the mere existence of a limit can be translated into a condition on the function g. 
Indeed, for u, v > 0, we have 

a(xuv) a(xuv) a(xv) 

a(x) a(xv) a(x) 

and therefore, the function g satisfies the classical Cauchy functional equation 

g(uv) = g(u)g(v). 


The solution of this equation follows from 


Lemma 2.3 Any positive measurable solution of the equation g(uv) = g(u)g(v), 
u, v > 0 is automatically of the form g(u) = u Y for a real y. 


If one writes a(x) = x y i(x), then the limiting relation a(xu)/a(x) —> u Y leads 
to the condition i{xu)/i{x) —> 1. This kind of condition is basic within the theory 
of regular variation and will be discussed in more detail in a later section. In 
particular, any measurable function £(x), positive for large enough x, that satisfies 
i{xu)/t(x) —> 1 will be called a function of slow variation or a slowly varying 
function (s.v.). The function a{x) = x y i{x) is then of regular variation or regularly 
varying, with index of regular variation y. 

We continue our derivation of Proposition 2.2. Take g(u) = u Y . Then it seems 
natural to clarify the notation in (2.3) by assuming that the right-hand side in the 
limit relation is given by h y . The latter function satisfies the functional equation 

h y (uv) = h y (v) u Y + h y (u) . (2.4) 


When y = 0, then we immediately find that there exists a constant c such that 
h(u) = clogw. If y ^ 0, then by symmetry we find that for u, v > \ 

hy(uv) = hy(v) U Y hy(u) = hy(u) V Y + hy(v). 
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From this it follows that for some constant d, h y (u) = d(u y — 1). We can incor¬ 
porate the case }/ = 0 if we replace the constant d by the constant c := yd. We 
therefore have derived that the right-hand side of an expression of the form 


lim UiUX) ~ UM = H u) 
x—oo a(x) 


is necessarily of the form h(u) = ch y (u) for some constant c, where the auxiliary 
function a is regularly varying with index y. 

For the case where f/ is a tail quantile function, we can do even better. For, U is 
monotonically non-decreasing. So if y > 0, then the constant d is also non-negative 
since h Y (u) is non-decreasing, while if y < 0, then also d < 0. Therefore, in both 
cases c = yd > 0 and so the quantity c can be incorporated in the non-negative 
auxiliary function a. This solves the equation (2.4) and proves the expression in 
the statement of Proposition 2.2. 

Let us return to the derivation of the explicit form of the limit laws in 
Theorem 2.1. Under C y (a), we find that with b n = U(n) and a n = a(n) 

/»oo /»oo 

E {z(a~ l (X n ^ n -b n ))} ^ / z(hy(l/v))e~ v dv =: / z(u) dG y (u) 

J 0 J —oo 


as n —> oo. This in particular shows that (up to scale and location) the class of 
limiting distributions is a one-parameter family, indexed by y. To get a more 
precise form, we rewrite the right-hand side of the above equation. 

It is easy to derive the three standard extremal types. Put h y (l/v) = u. Then 

• if y > 0 

E{z{a~ l (X n ^-b n ))) ^ f (exp(-(l + yw)~ 1/K )), 

J —y _1 


• if y = 0 

/ oo 

z ⑻ t/(exp(-e -”） ， 

-oo 

• if y < 0 


E {z {a n X {X n ^ n — 〜))}—> 



z(w)j(exp(-(l + }/m) _ 1/k )). 


Note that the range of G y depends on the sign of y. For y > 0, the carrier con¬ 
tains the positive half-line but has a negative left endpoint —l/y. For y < 0, the 
distribution contains the whole negative half-line and has a positive right end¬ 
point — l/y. Finally, for the Gumbel distribution Go, for y = 0, the range is the 
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whole real line. For convenience, we will sometimes write S Y for the range of 
the extremal law G y . Also, we will write r] y (u) for the solution of the equation 
h y (l/v) = u in terms of u. Explicitly 

r] y (u) = (1 + yu)~ [/y (2.5) 

with the understanding that r]o(u) = e~ u . 

The above analysis entails that under (C y ), the extreme value distributions can 
be obtained as limit distributions for the maximum of a simple random sample. 

It suffices to take b n = Uin) and a n = a(n) to have that a~ l {X n ^ n — b n ) Y y . 
Another way of stating this result is to write that F g V(G y ) if F satisfies C y (a). It 
can also be derived that they are also the only possible limits that can be obtained. 
For more details on this, we refer the reader to the work of Gnedenko (1943) and 
de Haan (1970); see also Beirlant and Teugels (1995). Densities of some extreme 
value distributions are sketched in Figure 2.1. 

The above result implies that, after a location and scale transformation x 
(x — b)/a, the sample distribution of maxima (Y — b)/a can be approximated by 
an extreme value distribution G y if the size n of the pure random samples from 
which these maxima are computed is sufficiently large. 

Now we should turn to the domain of attraction problem: for what kind of 
distributions are the maxima attracted to a specific extreme value distribution, or 
which satisfy the central condition (C y )l It will turn out that the sign of the EVI is 
the dominating factor in the description of the tail of the underlying distribution F. 
For that reason, we distinguish between the three cases where y > 0, y < 0 and 
the intermediate case where y = 0. Because of their intrinsic importance, we treat 
these cases in separate sections. But before doing that, let us include a concrete 
example. 

2.2 An Example 


We incorporate an example to illustrate the above procedure. 

The annual maximal discharges of the Meuse river in Belgium consist of max¬ 
ima Y\,... ,Y m where m denotes the number of years available. Using a 22-plot, 
we can attempt to fit the distribution of F = max{Xi,, X 365 } with an extreme 
value distribution. In this practical example, a right-skewness is apparent from an 
explorative data analysis (Figure 2.2). 

The quantile function of an extreme value distribution is given by 

( — ! —— V _ 1 

。、 \log(l/p )) 

Qy(P) = ， P ^ (0 ， 1). 

y 

The case y = 0 corresponds to the Gumbel distribution with quantile function 


Q 0 (p) = log 


(log(l/p) 


P e (0, 1). 
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Figure 2.1 (a) Gumbel density (solid line) and two Frechet densities for param¬ 

eters y = 0.28 (broken line) and y = 0.56 (broken-dotted line) and (b) Gumbel 
density (solid line) and two extreme value Weibull densities for y = —0.28 (broken 
line) and y = —0.56 (broken-dotted line). 

Note that the extreme value quantiles are obtained from standard Frechet(l) quan¬ 
tiles 吨 ^) (see Table 2.1) by using Box-Cox transformations x ~y X "- Except 
for the special case of a Gumbel quantile plot, a quantile plot for an extreme 
value distribution can only be obtained after specifying a value for y. We start by 
considering the simple case of a Gumbel 22-plot: 
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The Gumbel quantile plot for the annual maximum discharges of the Meuse river 
is given in Figure 2.3(a). The plot indicates that a Gumbel distribution fits the data 
quite well and supports the common use in hydrology of this simplified model 
for annual river discharge maxima. Performing a least squares regression plot on 
the graph given in Figure 2.3(a), we obtain an intercept value b = 1247 and a 
slope a = 446, which yield estimates of location and scale parameters of the fitted 
Gumbel model. 

When constructing an extreme value 2 2-plot, we look for the value of y in the 
neighbourhood of 0, which maximizes the correlation coefficient on the QQ-plot. 



1920 1940 1960 1980 

Year 
⑻ 



(b) 


Figure 2.2 (a) Time plot, (b) boxplot, (c) histogram and (d) normal quantile plot 
for the annual maximal discharges of the Meuse. 







54 


THE PROBABILISTIC SIDE OF EXTREME VALUE THEORY 


o - 

LO I I I I I 

- 2-10 1 2 

Quantiles of standard normal 

(d) 

Figure 2.2 (continued) 

This is obtained for a value y = —0.034. The corresponding 2 2-plot is given in 
Figure 2.3(b). Here, the least squares line is fitted with b = 1252 and a = 462. Of 
course, y, a and b can also be estimated by other methods. Here, we can mention 
the maximum likelihood method and the method of probability-weighted moments. 
In fact, these estimation methods are quite popular in this context. They will be 
discussed in Chapter 5. 
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Figure 2.3 Annual maximal river discharges of the Meuse: (a) Gumbel quantile 
plot and (b) extreme value quantile plot. 

These first estimates of y, b and a can now be used to estimate the 100-year 
return level for this particular problem. Indeed, 


U(100) =Q1- 


100 , 


1251.86 + 461.55 (— ^( 卜 ‘))。.。 34 - 1 = 3217.348 

-0.034 
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2.3 The Frechet-Pareto Case: y > 0 

We begin with a simple example that can easily be generalized. Then we look 
for the connection between the tail quantile function U and the underlying distri¬ 
bution F. After rewriting the limiting result in its historical form, we give some 
sufficient conditions. We finish with a number of examples. 


2.3.1 The domain of attraction condition 


As the prime example, we mention the strict Pareto distribution Pa(a) with 
survival function F(x) = x~ a , x > 1. Here, a is a positive number, called the 
Pareto index. For this distribution, Q(p) = (1 — p)~ l ^ a and hence U (x) = x Y 
with y = l/a. Then 


{U(xu) — U (x)} /a(x) = (^(xu) y — /a(x) 


x y 

a(x) 



i) 


so that the auxiliary function a(x) = yx y leads to 

u y - 1 

{U(xu) — U (x)} /a(x) — - = h y (u), 

y 

and hence C r (a) is clearly satisfied, actually with equality. 

However, there is a much broader class of distributions that satisfies C Y (a) 
with y > 0. Indeed, we can take U (x) = x y iu(x) where iu is a slowly varying 
function. Then (C y ) is also satisfied. Indeed, for x 个 oo, 

{U(xu) — U (x)} /a(x) = [{xu) Y iuixu) — x y lu( x )) /ci{x) 

_ lu(x)x y / iu{xu) 
a(x) V iu(x) 

〜 (〆 - l) /y 

when choosing a(x) = yx y lu(x) = yU(x), or even more flexibly, 

lim a(x)/U(x) = y. (2.6) 

■X ： 个 oo 



Distributions for which U (x) = x Y iu(x) are called Pareto-type distributions. 
Remark that for these distributions, U is regularly varying with index y since 


lim lim 

oo U (x) X—oo xylu(x) 


=〆 for all t > 0. 


Also note that once the domain of attraction condition C y (a) is satisfied, we 
can choose the normalizing constants by the expressions 


b n = U (n) = n Y i\j(n) and a n — a(n). 
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Figure 2.4 Plot of P{(X n ^ n - U(n))/(yU{n)) < x} = (^l - ( 巧 ) -1 )” for w = 2 
(dotted line), n = 5 (broken-dotted line), n = 10 (broken line) and its limit, for 
n oo, exp(—(1 + x) _1 ) (solid line). 


The convergence of (X n ， n — U(n))/(yU(n)) to its extreme value limit is illus¬ 
trated in Figure 2.4 for the case of the strict Pareto distribution with y = l. Here, 
the convergence appears to be quite fast. In Chapter 3, this will be shown not to 
be the case overall. 


2.3.2 Condition on the underlying distribution 

We will show later that the condition (C y ) for y > 0 is equivalent to the condition 
that for w > 0, 


1 — F(xw) 
1 - F(x) 



as x oo . 


This is precisely saying that x x ^ Y (1 — F(x)) is s.v. Hence, there exists a s.v. func¬ 
tion £f(x) such that 1 — F(x) = x~ a ip{x) where a It will then follow that 
the definition of a Pareto-type distribution can be formulated in terms of the dis¬ 
tribution F as well as in terms of the tail quantile function. The term Pareto-type 
refers to the tail of the distribution, and roughly spoken, it means that as x ^ oo, 
the survival function 1 — F(x) tends to zero at a polynomial speed, that is, as x~ a 
for some unknown index a. 

The link between the two s.v. functions depends on the concept of the de Bruyn 
conjugate introduced in Proposition 2.5 in section 2.9.3. In its neatest form, we 
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can state that there is full equivalence between the statements 

1 - F{x) = x~ llY l F (x) and U(x) = x y i v (x) (2.7) 


where the two slowly varying functions ip and lu are linked together via the 
de Bruyn conjugation. Remark that with increasing value of y the tail becomes 
heavier, that is, the dispersion is larger; otherwise stated, large outliers become 
even more likely. When y > 1, the expected value E(X) of X does not exist as 
can be readily checked for the strict Pareto distribution. For y > 0.5, even the 
variance is infinite. For this reason, Pareto-type distributions are often invoked to 
model data with extremely heavy tails. More specifically, denoting the positive 
part of X by X + , one finds that 


E(X C + )= 


oo, 

< oo, 


cy > 1, 
cy < \. 


2.3.3 The historical approach 

We remark that our approach follows a different path than in the classical theory. 

v 

From the general approach, we know that [X n ^ n — U(n)}/a(n) -> Y y . However, 
as derived in Theorem 2.3, we also know that a(n)/U(n) y. If we write 

Xn,n _ X n ， n _ U (ji) + U (ji) 1 

U(n) U(n) a(n) a{n) J 

then ^ Z v where 

U{n) y 

P{Zy <z}^p Y (y y + ^ < zj = G y (_) = exp(-z- 1/j/ ). (2.8) 

In the early literature on the subject, the latter distribution is often abbreviated by 
Oi/ K (z) and called the extreme value distribution of type II. The limit law in terms 
of Z y looks simpler than the one for Y y . We write F e V(G y ) = From 

the statistical point of view, it is advisable to work with the condition in terms of 
the quantile function U as the investigator usually does not know in advance that 
y > 0. 

As the first example of (2.8) appeared in Frechet (1927), the class P(Oi/ K ) 
should be attributed to him. On the other hand, an equivalent description is given 
by the Pareto-type distributions. We therefore opted for the Frechet-Pareto-class 
terminology. 


2.3.4 Examples 

Examples of distributions of Frechet-Pareto type are given in Table 2.1. We note 
that the condition in (2.7) can easily be used to derive explicit examples for the 
Frechet case. 
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A well-known sufficient condition can be given in terms of the hazard function 


r(x)= 


/ ⑴ 

1 - F(x) 


where it is assumed that F has a derivative /. 


Proposition 2.1 Von Mises’ theorem Ifx^ = oo and lim x |oo xr{x) = a > 0, then 
Fe P(O a ). 


If we use Theorem 2.3 (ii) below and define e{x) ：= xr(x) — a, then it easily 
follows by integration that 


1 - F(x) = {1 - F(l)}exp 



a €(u) 
u 



=: Cx~ a i(x) for x > 


1. 


The function £(x) defined in this way can easily be shown to be slowly varying. 

The von Mises result is readily applied when for some > 1, the tail of the 
density / is of the form /(x ) 〜 x~^i{x) for some s.v. i(x). It then follows that 
also the tail of F is regularly varying. Actually, 


f(xv) 

1 - Hx) = L mdy = xf( H -m dv ^ {p - 1)_lx/(x) 


and thus xr(x) — l so that the von Mises sufficient condition is satisfied with 
a = ^ 

Examples of Pareto-type distributions are the Burr distribution, the Generalized 
Pareto distribution, the log-gamma distribution and the Frechet distribution (see 
Table 2.1). 

Note that all t-distributions are in the Frechet-Pareto domain. A f-density / 
with k degrees of freedom has the form 


Then 


f(x) = 


r (Y) 

vw(!) 



X G M . 


1 — F(x ) 〜 


i r (¥)(, , 

r(|) V 



e IZ—k. 


Hence, F e P(O^). In particular, the Cauchy distribution is an element of P(Oi). 

But also all F-distributions are in the Frechet-Pareto domain. An F-density 
with m and n degrees of freedom is given by the expression 


m = 


r(f)r(§) Vn 


/m\ m / 2 m , 

(八 m \ 

m+n 

T~ 


- x7-' 

(1 H - _ x) 

•> 

X > 

\ n / 

V n / 
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As before 1 — 〜 C m ， n x— 吶 1 as x oo with C m , w a constant and hence F g 

Most of the above examples have a slowly varying function lu of the type 

i v (x) = C(l + Dx~^(l + o(l))) x oo 

for some constants C > 0, D g M, > 0. The corresponding subclass of the 
Pareto-type distributions is often named the Hall class of distributions referring 
to Hall (1982). This class plays an important role in the discussion of estimators 
of a positive EVI y. 

2.3.5 Fitting data from a Pareto-type distribution 

Here, we recall and extend some of the basic notions on 2 2-plots discussed earlier 
in section 1.2.1. 


Case 1: The strict Pareto case 

The strict Pareto distribution Pa(a) with survival function F(x) = x~ a (x > 1) 
has quantile function Q(p) = (1 — p)-” a . Hence, log Q(p) = —士 log(l — p). A 
Pareto 2 2-plot is obtained from an exponential 2 2-plot after taking the logarithm 
of the data: 

(- log(l - Pi， n ), logx,、） ， / = 1，… 

Indeed, taking a log-transformation from a strict Pareto random variable results in 
an exponential random variable. We expect to see a linear shape with intercept zero 
and slope approximately equal to 1 /a in case a strict Pareto distribution fits well. 


Case 2: The bounded Pareto case 

We put a left bound on the Pareto distribution. To do that, we consider the con¬ 
ditional distribution of X given X > t where the data are censored by a lower 
retention level t. The conditional distribution is obtained from 

F t (x) = P (X > x\X > t) = (y) ， X > t 

and its quantile function is given by 

q,(p) = t (i ~ P r l/a , p e (o, l). 

Hence, performing the same procedure described as for the strict Pareto case, 
we plot 

(- log(l - Pi， n \ logXi^n) , / = 1 ，…， "， 

and we obtain a straight line pattern with intercept log f and a slope again approx¬ 
imately equal to 1/a. 
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0 2 4 6 8 

Standard exponential quantiles 
(a) 


0 2 4 6 8 

Standard exponential quantiles 
(b) 

Figure 2.5 Pareto 2 g-plot for simulated data of size n = 1500 from (a) the 
Burr(l,2,2) distribution (y = 0.25) and (b) the log T(l, 2) distribution (y = 1). 


Case 3: Pareto-type distributions 

Pareto-type distributions were introduced in 2.3.1. The effect obtained by using a 
Pareto 2 2-plot to a Pareto-type distribution is illustrated in Figure 2.5. 

In all the examples in 2.3.4, the leading factor in the expression of the survival 

. . _i … 

function is of the form x~y for some positive number y. The value of y is also 
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given in Table 2.1 in terms of the model parameters. In Chapter 4, we will present 
an estimation procedure for a based on the Pareto 22-plot. 

As we have shown, Pareto-type distributions were found to be the precise set of 
distributions for which the sample maxima are attracted to an extreme value distri¬ 
bution with EVI y = 1/a > 0. However, it now appears that this set of distributions 
can be used in a broader statistical setting than the one considered in section 3.1 
where the statistical practitioner had only access to data from independent block 
maxima. Indeed, Pareto-type distributions are characterized by the specification 

U (x) = x Y iu(x) 

for the tail quantile function U. Here, i\j denotes an s.v. function as indicated in 
(2.7). So, 

2(1 - /)) = p-Hu ( 去 )， p e (o’ 1). 

Then 


log 2(1 - p) = -y\ogp + \ogiu 



and, since for every s.v. function 


lo g~ ⑴ 

lOg/7 


— 0 , 


as p ^ 0, 


(see (2.19) below) we have that 


log 6(1 - p) 
-log/? 


as p —> 0 


which explains the ultimate linear appearance in a Pareto 2 2-plot in case of an 
underlying Pareto-type distribution. This is illustrated in Figure 2.5. 

In Figure 2.5, the Pareto 2 2-plot has been constructed for simulated data 
sets of size n = 1500 (a) from the Burr distribution and (b) from the log-gamma 
distribution. It is clear that the straight line pattern only appears at the right end 
of the plot. This has been suggested by superimposing a line segment on the right 
tail of the 22-plot. The fitting of this line segment has been performed manually. 
In both cases, the slope of the fitted line is set equal to the reference value \/a. 

Pareto-type behaviour can also be deduced from probability-probability plots 
or PP-plots. For the strict Pareto distribution, the coordinates of the points on the 
尸尸 -plot are given by 



1， 




In case the Pareto distribution fits well, we expect the points to be close to the 
first diagonal. This ‘classical’ PP-plot requires knowledge of y, or at least of an 
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estimate of y. Alternatively, log-transforming the above coordinates and changing 
signs leads to the plot 

(logx/, n , - log(l - pi^n)) , i = l, (2.9) 

which is obtained from the Pareto 2 2-plot by interchanging the coordinates. For 
the strict Pareto distribution — log(l — F(x)) = y log x，so the Pareto probability 


-20 -15 -10 -5 0 5 

log ⑻ 

(b) 

Figure 2.6 Pareto 尸尸 -plot for simulated data of size n = 1500 from (a) the Pa(l) 
distribution, (b) the Burr(l,0.5,2) distribution and (c) the log T(l, 1.5) distribution. 
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I i I I I 

0 2 4 6 8 

logW 
(c) 

plot defined by (2.9) will be approximately linear with slope Using arguments 
similar to the ones in the discussion of the Pareto quantile plot, it is easy to 
show that for Pareto-type distributions, the plot (2.9) will be ultimately linear with 
slope This is illustrated in Figure 2.6. 

2.4 The (Extremal) Weibull Case: y < 0 

We follow the same pattern as in the previous case. It will turn out that there is a 
full equivalence between the cases y > 0 and y < 0. 

2.4.1 The domain of attraction condition 

Let us again start with a simple example, slightly more complicated than the 
uniform distribution. Let 0 < x 氺 < oo and look at the survival function on (0, x 氺 ): 

l-F(x) = (l-x/x^, 

where ^ > 0. It follows that U (x) = x*(l — x~ 1 ^) on [1, oo). Then 
{U (xu) — U (x)} /a(x )= 


a(x) 


,(1 - ( xm ) 1 ) - (1 - x — 芦） 






a(x) 


x^x 




h i (u). 
Ba(x) 一下 





寸 
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Therefore, we recover the condition C y (a) if we make the choice y = —^ <0 and 
a(x) = (1 /for the auxiliary function. Note that a e 1Z Y as it should be 
and that a(x) = (—y)(x* — U(x)). 

Again, there is a much broader class of distributions where the above reasoning 
holds. Let x* < oo, put 1 — F(x) = (x^ — x)~ { ^ Y ip (l/(x* — x)) as x 个 x*，and 
put £(v) = l y F (v). It then follows by Proposition 2.5 that U (j) = — y Y ^u(y) as 

;y 个 oo where iu(y) = with t* denoting the de Bruyn conjugate of i 

(defined in section 2.9.3 below). Then 


[U (xu) — U (x)} /a(x)= 


x Y i u {x) 

a(x) 



y lu(xu)\ 

~UAx)) 


-y X -^h A u) 

a(x) 


which is of the required C y (a) form if we choose 


a(x) 


x^ — U (x) 


( 2 . 10 ) 


Figure 2.7 shows the convergence of (X n ^ n — — U{n))) to its 

extreme value limit in case of the U (0, 1) distribution. 



Figure 2.7 Plot of P{(X n ^ n — U(n))/a(n) < x} = (l + ^ ■广 for n = 2 (dotted 
line), n = 5 (broken-dotted line), n = 10 (broken line) and its limit, for n —> oo, 
exp(—(1 — x)) (solid line). 
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2.4.2 Condition on the underlying distribution 

Note that again all the previous steps can be reversed so that there is full equiva¬ 
lence between the statements 

1 — F (x* — 丄 ) =xy , x 个 oo 

and (2.11) 

U(x) = — x y lu(x) , X t oo . 

The proof is similar to the one given for y > 0 in section 2.9.3. 

2.4.3 The historical approach 

We give an indication of the earlier derivation for the case where y < 0. We 
continue to take x* < oo and (x* — U(x))/a(x) —> —y _1 . So, write 

X n ,n - _ a(n) X n ， n -U(n) + U(n) - ^ 1 

— U (n) — U (n) a(n) a(n) | 

If we are in the domain of attraction T>(G y ) for y < 0, then also the left-hand side 
of the above equation converges in distribution, say to Z y and we have 

= (~y) 

with distribution 

P(Z y <z) = exp(-|z 「 告） 

for z < 0. 

Again, the latter limit relation has been treated differently in the older literature. 
One used to write for z < 0, ^ = exp(—|z| a ) for the extreme value distribution of 
type III. We then see that F e V(G y ) = It is true that the above limit 

distribution is again a bit simpler than the one using Y y . However, usually the 
statistician has no prior information on the sign of the EVI y. Moreover, splitting 
the one-parameter set of limit distributions into three apparently different subcases 
spoils the unity of extreme value theory. 

2.4.4 Examples 

Put Y := (x* — X) -1 . As mentioned before, the extreme value Weibull case and 
the Frechet case are easily linked through the identification: 


Fx € ^(^a) O Fy e . 



68 


THE PROBABILISTIC SIDE OF EXTREME VALUE THEORY 


s 

— 

7 

+ 

-IK 

7 

y - 、 

—IK 

1 

TK 

G,|«S> 

+ 

7 

Cl, 

/ — 、 

—IK 

1 

二 S 

T 

>< 

+ 

»- 

i 

=0. 

1 

€ 

lh 

i 

Extreme 

value 

index 

7 

1 

i 

-h| ；S 

1 

-"IH 

1 

X 

1 

—|H A 

_S 

1 o 
C A 

7 ^ 

^ a, 
s 一 

?? ^ 

—Ik 

o 

A 

卜 

j5o 

A 

K 

r ® 

K A 

丄 s 

(D 八 

二 M 

Distribution 

Uniform 

百 

CQ 

0 

T 

( 

c 

3 

a 

1 

3 

> 

D 

Extreme value Weibull 


d-s mop {3 -o PAVo=sIcd>(us(ub><a-)9ql uj SUOJln .-aJlsJP jo is 一 I—Iv(N(Nos^I 



THE PROBABILISTIC SIDE OF EXTREME VALUE THEORY 


69 


This equivalence follows from simple algebra in that 

1 _ F - ) = P (X > - ) = P ( -- > = 1 — Fy(X). 

Some examples of distributions in the extreme value Weibull domain are given in 
Table 2.2. 

Apart from the determination of the right endpoint the cases y < 0 and 
y > 0 are fully equivalent. Note in particular that in case a density exists, fy(x)= 
x~ 2 fx(x^ — x -1 ). Therefore, there exists a sufficient von Mises condition in terms 
of the hazard function is r = //(l — F). 

Proposition 2.1 Von Mises’ theorem Ifx^ < oo and — x) r(x) = a > 

0, then F G ^(^). 

The proof is similar to the one for the Frechet-Pareto case. From the condition, it 
follows that 


/ ( x * ~ ?) 

- ； - iT 〜 af 

1 - F (x* - 7 ) 

which leads to fx(x^ — 0 〜 t a ~ x i{\/t) when ? ^ 0 and where i is slowly vary¬ 
ing. 

Explicit examples are known as well. Beta-distributions are among the most 
popular elements from the extreme value Weibull domain. Recall the beta-density 
with parameters p and q 


f(x) = 



B(p, q) 


(i w 1 


Here, x* = 1 and 1 — F(l — x) 〜 {qB(p, q)}~ 1 x q making the distribution an ele¬ 
ment from V{^ q ). In particular, the uniform distribution is an element of V{^\). 

A graphical description of tails of models with right-bounded support as a 
function of the value of y is given in Figure 2.8. Remark especially the different 
shapes near the endpoint x* around the values }/ = — 1 /2 and — 1. 


2.5 The Gumbel Case: y = 0 

This case, often called extremal type I, is more diverse than the two previous ones: 
this set of tails turns out to be quite complex as can be seen from Table 2.3, which 
contains a list of distributions in this domain. 

2.5.1 The domain of attraction condition 


The class (Co) is called the Gumbel class as here the maxima are attracted to 
the Gumbel distribution function A(x) := Go(x) = exp (— 厂 The domain of 
attraction is denoted by T>(A). 
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Figure 2.8 Tails of distributions: different cases along the value of the EVI. 
(a) y > 0, no upper bound, (b) — 1 < y < 0, finite endpoint x+, zero density, 
(c) y = —1/k (k < 3, integer), zero density at first (k — 2) derivatives of the 
density function are zero, (d) y = —1/2, zero density, finite first derivative at 
(e) —\<y< —1/2, zero density at x+，infinite first derivative, (f) y = — 1 ， 
non-zero finite density at and (g) y < —1, infinite density at 
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The most central example is the exponential distribution with survival function 
1 — F(x) = e~ Xx , x > 0, with 入 > 0. Then Q(p) = —士 log(l — p), and so U(x) — 
j logx. Then 


{U(xu) — U (x)} /a(x) = — (\og(xu) — log(x)) /a(x) 

A. 


Xa(x) 


log ⑻ 


so that the constant function a(x) = 1/A, leads to 


{U(xu) — U (x)} /a(x) = log(M), 


and (Co) is satisfied. 

The convergence of (X njl — U(n))/a(n) to the Gumbel limit is illustrated in 
Figure 2.9 for the Exp(X) distribution. 

As opposed to the other two domains of attraction, the elements of T>(A) cannot 
be considered as being from the same type as this prime example. Validating that 
all the other examples from Table 2.3 belong to (Co) can be a tedious job. We 
provide some alternative conditions for T>(A) next. 



Figure 2.9 Plot of P{(X n ^ n — U(n))/a(n) < x}= ( 丄 — 哪 !^) ) for n = 2 (dot¬ 
ted line), n = 5 (broken-dotted line), n = 10 (broken line) and its limit, for n —> oo, 
exp (— exp(—x)) (solid line). 
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Table 2.3 

A list of distributions in the Gumbel domain. 

Distribution 

1 - F(x) 

Benktander II 

eX p (_!/), 


x > 0; a, /3 > 0 

Weibull 

exp (—Xx r ), 
x > 0; k,r > 0 

Exponential 

exp(—Xx), 

exp ( 入） 

x > 0; A > 0 

Gamma 

/ x °° u m ~ { Qxp(-Xu)du, 


x > 0; X,m > 0 

Logistic 

1/(1 + expO ))， 


x eR 

Log-normal 

L 丄， X P( A dog ⑷ 价)权 


x>0;/xgM, (T>0 


2.5.2 Condition on the underlying distribution 


The characterization of P(A) in terms of the distribution function is also more 
complex than in the other two cases. The pioneering thesis of de Haan (1970) 
gave a solution to this problem, revitalizing the interest in extreme value analysis. 

Proposition 2.1 The distribution F belongs to T>(A) if and only if for some auxil¬ 
iary function b for every v > 0 


1 - F(y + b(y)v) 
~1 一 F(y )~ 

as y x^. Then 

b(y + vb(y)) 

b(y) 


( 2 . 12 ) 


This result being of quite different nature than for the Frechet-Pareto and the 
extreme value Weibull case, the question can now be raised if the formulation 
in (2.12) can be generalized to the general case of (C y ). This will be done in 
section 2.6. A sketch of proof for this general case will be given in section 2.9.4. 


2.5.3 The historical approach and examples 

The case y = 0 has been responsible for the seemingly awkward treatment of the 
three extreme value domains. If one tries to find the appropriate centering and 
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norming constants for the most classical distribution from statistics, the normal 
distribution, the calculations are by no means trivial, on the contrary. 

The von Mises sufficiency condition is a bit more elaborate than before. 

Proposition 2.2 Von Mises’ theorem If r(x) is ultimately positive in the neigh¬ 
bourhood of x 本， is differentiable there and satisfies lim x 个;^ dr ^ = 0, then F belongs 
to T>(A). 

The calculations involved in checking the attraction condition to A are often 
tedious. In this respect, the von Mises criterion can be very handy, particularly 
as the Gumbel domain is very wide. This can be illustrated with the normal 
distribution as well as with the (classical) Weibull distribution F(x) = 1 — e~ x 
with a > 0 and x > 0. Also, the logistic distribution has an explicit expression 
F(x) = {1 + exp(—(x — a)/b)}~ 1 , which is easily shown to satisfy the von Mises 
condition. The calculations for the log-normal distribution are somewhat tedious. 

2.6 Alternative Conditions for (C y ) 

We return to the general case. In view of statistical issues for arbitrary values of 
the EVI y, we need alternative conditions for the general domain of attraction 
condition (C y ). The proofs of the results in this section are probably among the 
most technical points in the discussion of the attraction problem for the maximum. 
Then again, we do not treat the most general case since we started out from the 
added restriction that F is continuous. Proofs are deferred to the end of the chapter. 

(/) A first and equivalent condition is given in terms of the distribution function. 
The result comes from de Haan (1970) and extends Proposition 2.1 for the case 
y = 0 to the general case. The derivation is postponed to section 2.9.4. 

Proposition 2.1 The distribution F belongs to T>(G y ) if and only if for some aux¬ 
iliary function b and l yv > 0 


1 - F(y + b(y)v) 
~1 — F(y )~ 


^ (1 + yv) 一 1/y 



(2.13) 


as y x^. Then 


b(y + vb(y)) 

Hy) 


—> = 1 + /u . 


As will be shown in section 2.9.4, the auxiliary function b can be taken as b(y)= 

Another equivalent condition is closely linked to the above. Instead of letting 
y ^ oo in an arbitrary fashion, we can restrict y by putting 1 — F(^) = n~ x or 
equivalently y = U(n). 
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Proposition 2.2 The distribution F satisfies (C y ) if and only if 

n{\ - F(U(n) + b n v)} H(v) (2.14) 

for a positive sequence b n and a positive ， non-constant function H. 

As before, the mere existence of the limit is enough for the explicit form (1 + 
yv)~ { /y for a real y. 

⑹ Here is a first necessary condition for (C y ) that will be crucial in the 
statistical chapters. The condition (C y ) entails that for x -> oo 

logw, if y > 0, 


if y < 0 

provided > 0 . 

(iii) The relationship between U and a as appearing in conditions (C y ) and (C y ) 
is different in the three cases. This becomes clear from the following result. 

Theorem 2.3 Let (C y ) hold. 

(i) Frechet-Pareto case: y > 0. The ratio a(x)/U (x) y as x ^ oo and U is 
of the same regular variation as the auxiliary function a; moreover ， (C y ) is 
equivalent with the existence of a s.v. function tu for which U (x) = x Y 

(ii) Gumbel case: y = 0. The ratios a(x)/U(x) —> 0 and a{x)/{x^ — U (x)} —> 0 
when is finite. 

(iii) (Extremal) Weibull case: y < 0. Here is finite, the ratio a(x)/{x^ — U (^)} ^ 

—y and {x* — U (x)} is of the same regular variation as the auxiliary function 
a; moreover, (C y ) is equivalent with the existence of an s.v. function lu f or 
which — U (x) = x y iu(x). 

However, the function a can be linked to the mean excess function of the log- 
transformed data over all cases. We first consider the case where y > 0. In a later 
section, we will see how Hill’s estimator can be motivated from the fact that 



^ - {\ogU(xu) - \ogU(x)} ^ 
a(x) 


ClogX(logX ):= 五 


log-|X >^j 


广 \og(u/x) 
J x 1 - F(x) 


dF{u) 


厂 1 - F(u) du 广 1 - F(xy) dy 
Jx 1 — u Ji 1 - F(x) y 


for x —> oo tends to j -1 /k = y (i n Theorem 2.3 we show why the limit and 

integral can be interchanged). 
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In the sequel, we will show that under (C y ) 

if y > 0, 

if y < 0 (2.15) 

provided x* > 0, 


U(x)ei osX QogU(x)) 

a(x) 


u y -l du 


y 


when x oo. It then follows that the function b appearing in (C*) can be taken as 

b(t) = (1 - y~)te\ ogX ^ogt) (2.16) 

where 

v -_{ 0, if y > 0, 
y ~ y, if y < 0. 


2.7 Further on the Historical Approach 

As illustrated in the previous sections, the literature offers other forms that are more 
customary as they refer to the distribution F rather than to the tail quantile function 
U. We feel the need to formulate the result in its historic form. We include the 
domain of attraction conditions as well. The next theorem contains the historical 
results derived by Fisher and Tippett (1928) and Gnedenko (1943). 

Distributions that differ only in location and scale are called of the same type. 


Theorem 2.1 Fisher-Tippett-Gnedenko Theorem. The extremal laws are exactly 
those that agree to within type with one of the following (a > 0) 


(i) Frechet-Pareto-type: 


Oqj(i) = 


0, ifx < 0, 

e_ x ' ifx > 0. 


Moreover, F G if and only if for x ^ oo 


- F(Xx) 
1 - F(x) 


k— a ，for all X > 0 . 


(ii) Gumbel-type: 


A(x) = e~ e x G M. 


Moreover F G T>(A) if and only if for some auxiliary function b for x ^ oo 


1 — F(x + tb(x)) 
1 — F(x) 


e~ l , for all t > 0. 
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小 r 


(iii) (Extremal) Weibull-type: 

少 0 ?( 又） = 

Moreover F e T>{^ a ) if and only if < oo and for x —> oo 


if x < 0 , 
ifx > 0 . 


- F ( x * — II) 
—F(x^ — j) 


k— a ，for a// 入 > 0 . 


2.8 Summary 

We summarize the most important results of this chapter concerning conditions on 
1 — F or the tail quantile function U ( 3 ^) = Q(1 — 1 /y) for a distribution to belong 
to the maximum domain of attraction of an extreme value distribution. 


Conditions on U Conditions on 1 — F 

(Cy) (C；) 

lim^oo =h Y {u) lim y —為 1 gf) =r, y (v) 

where b(y) can be taken as 

4 (1 - y _ )yeiogx(log.y) 

{Cy) 

lin^oo 黑 log 

logw, if y > 0 , 

= 9 ， if y < 0 , 

where a(x) can be taken as 
(1 - Y~)U (x)eiogx(log U(x)) 


lim^oo m^. = u y, u>0 V> Q 


Pareto-type 
distributions 
y > 0 


General case 
y G R 


2.9 Background Information 

In this section, we collect a number of results that provide most of the background 
necessary to fully understand the mathematical derivations. Some of these results 
are proved while for the proofs of other statements, we refer the reader to the 
literature. We begin with information on inverse functions that is useful to see 
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what kind of conditions we are actually imposing on the underlying distributions. 
We then collect information on functions of regular variation: Here Bingham et al. 
(1987) is an appropriate reference. In the last part, we return to alternative forms 
of the condition (C y ) and the relation to the underlying distribution F. 

2.9.1 Inverse of a distribution 

Start with a general distribution F. Denote by := inf{x : F(x) > 0} the left-end 
boundary of F while similarly := sup[x : F(x) < 1}. We define the tail quantile 
function U by U(t) := inf{x : F(x) > 1 — l/t}. Note that the tail quantile function 
U and the quantile function Q are linked via the relation U(t) = Q{\ — l/t). 
From this definition, we get the following inequalities: 

(i) Ifz< U(t), then 1 — F(z) > l/t ; 

(ii) for all f > 0, 1 — F(U(t)) < l/t ; 

{iii) for all x < x^, U (1/(1 — F(x))) < x. 

Note that U (1) x while U (oo) = x*. It is easy to prove that if F is contin¬ 

uous, then the equality 1 — F(U(t)) = l/t is valid while if U is continuous, 
U (1/(1 — F(x))) = x. In any case, U will be left-continuous while F is right- 
continuous. The easy transition from F to U and back is the main reason for 
always assuming that F and U are continuous. 

Even more special is the important case where F has a strictly positive deriva¬ 
tive / on its domain of definition. For then 

1 - F(U(t)) = f X * f(y) dy = -. 

JUit) t 

But then also U has a derivative u and both derivatives are linked by the equation 

f(U(t))u(t) = r 2 (2.17) 

which could help in calculating w or f/ if F is known through its density function /. 

2.9.2 Functions of regular variation 

In this section, we treat a class of functions that shows up in a vast number of appli¬ 
cations in the whole of mathematics and that is intimately related to the class of 
power functions. We first give some generalities. Then we state a number of funda¬ 
mental properties. We continue with properties that are particularly important for us. 

Definition 2.1 Let f be an ultimately positive and measurable function on R+. We 
will say that f is regularly varying if and only if there exists a real constant p 
for which 

f (xt~) 

lim : - = t p for all t > 0. (2.18) 

i 个 oo f(x) 
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We write f G 1Z P and we call p the index of regular variation. In the case p = 0, 
the function will be called slowly varying (s.v.) or of slow variation. We will reserve 
the symbol t for such functions. 

The class of all regularly varying functions is denoted by 1Z. 

It is easy to give examples of s.v. functions. Typical examples are 

• l(x) = (logx)i for arbitrary a. Further on, we will always drop the + sign; 
regular variation indeed is an asymptotic concept and hence does not depend 
on what happens at fixed values. 

• l(x) = where logj x = logx while for n > l, log n+1 x := 

log(log„x). 

• t satisfying i(x) —> c G (0, oo). 

• £(x) = exp{(logx)^} where < 1 . 

The class IZq has a lot of properties that will be constantly used. Some of the 
proofs are easy. For others, we refer to the literature. 

Proposition 2.2 Slowly varying functions have the following properties: 

(i) IZq is closed under addition, multiplication and division. 

(ii) Ifi is s.v. then t a is s.v. for all a G M. 

(iii) If p eR, then f e 1Z P iff f~ x g 1Z— p . 

Mathematically, the two most important results about functions in IZq are given in 
the following theorem due to Karamata. 

Theorem 2.3 

(i) Uniform Convergence Theorem. Ifi e IZq, then the convergence 


lim 


i{xt) 


x\oo £(x) 


is uniform for t G [a, b] where 0 < a < b < oo. 

(ii) Representation Theorem, i g IZq if and only if it can be represented in the 
form 


i{x) = c(x) exp 



e ⑻ 


du 


where c{x) —> c G (0, oo) and €(x) — > 0 似 x oo. 
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The second part of the theorem above allows direct construction of elements of 
TZq. For later reference, we collect some more properties in the next proposition. 


Proposition 2.4 Let i be slowly varying. Then 

(i) 

r log £(x) 
lim - = 0. 

对 oo log X 


(2.19) 


(ii) For each 5 > 0 there exists a xs so that for all constants A > 0 and x > xs 

Ax~ 8 < l(x) < Ax 8 . 

(iii) If f ^ Tip with p > 0, then f(x) oo, while for p < 0, f(x) 0 
as x ^ oo. 

(iv) Potter Bounds. Given A > l and 8 > 0 there exists a constant x 0 (A, 8) such 
that 


l -^<A max \r-) 8 ,ay 

i(x) - (V^：/ \x) 


> x Q . 


Statements (/) and (ii) follow easily from the representation theorem. Unfortu¬ 
nately, (ii) does not characterize slow variation. For (iii) and p > 0, use (ii) to 
see that for x large enough f(x) ~ x p t{x) > x p ^ 2 which tends to oo with x. Rela¬ 
tion (iii) shows that regularly varying functions with p ^ 0 are akin to monotone 
functions. Further, Potter bounds are often employed when estimating integrals 
with slowly varying functions in the integrand. 


2.9.3 Relation between F and U 

The link between the tail of the distribution F and its tail quantile function U 
depends on an inversion result from the theory of s.v. functions. We introduce the 
concept of the de Bruyn conjugate whose existence and asymptotic uniqueness is 
guaranteed by the following result. 

Proposition 2.5 If £(x) is 5 .v., then there exists an s.v. function the de Bruyn 

conjugate of l, such that 

£(x)£* > 1 , x 个 oo. 

The de Bruyn conjugate is asymptotically unique in the sense that if also i is s.v. 
and £(x)l(xl(x)) 1, then £* 〜 I. Furthermore (€*)*〜€• 
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As a simple example, check that if € = log, then €* 〜 1/ log. Let us illustrate 
how (2.7) can be obtained from this proposition. Put y := 1/(1 — F(x)). Then 
1 — F(x) = x~ a £f(x) translates into 

y=X a r F \x) = {Xl{x)) a 

where 


£0) := £^ 1/a (x). 

By Proposition 2.5, one can solve the equation y l ^ a = xi(x) for x yielding x = 
y\/a^^y\/a^ w here i* is the de Bruyn conjugate of i. The direct connection 
between the two functions F and U is given by f/(l/(l — F(x))) 〜 x (check 
this!), or by 


x 〜 U(y) = yHuiy)= 严 €*( 严 ). 

This yields that indeed y = a~ [ and that 〜 £*(x y ). We see how the link 

between if and iu runs through i and its de Bruyn conjugate - Note that all 
the previous steps can be reversed so that there is full equivalence between the 
statements. 

Example. Assume that £f(x) = (logx)^; then l(x) = (\ogx)~^^ a = (\ogx)~^ y 
and we need to solve the equation y = {xi{x)) a for x. This means that 

= x(logx)~^ y 〜 ; y 士士 ) {logy 士 + log€*(y«)} _ ^ y . 

This in turn leads to 


i 〜 roodogjo’（ i 


log£%v ) 、 
logy , 


-Py 


Now use Proposition 2.5 to deduce that l*(y) = (log y)^ Y and hence that iuix) 
(y logx)^ y . 


2.9.4 Proofs for section 2.6 

We start with the proof of Proposition 2.1. To this end, define k(x, u) by the 
relation U(ux) — U (x) = k(x, u)a(x) where u > 0 has been fixed. By definition, 
the inverse function Q(l — 士 ） equals U (x) + k(x, u)a(x) where k(x, u) —> h y (u) 
as x 个 oo. But then 


1 

— =1 — F(U (x) + k(x, u)a(x)) 
ux 

while at the same time 1/x = 1 — F(U (x)). Put y = U (x). We find that 

1 = l-F(y^k(y,u)a(U^(y))) 
u 1 — F(j) 
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where we have put k(y, u) = k(U^(y), u) = k(x, u). Replace a(U^~(y)) by b(y). 
Further change u into v by the relation v = h y (u). Equivalently, by (2.5) w = (1 + 
yv ) l ^ Y ; for y = 0, v = log u immediately leads iou = exp v. Replace this w by i; in 
k(y, u) to obtain k(y, v). We have therefore transformed the original equation into 


1 - + b(y)k(y, v)) 

1 - F(y) 


=(1 + yv)~ l/y 


= r] Y {v). 


However, when x —> oo, k(x, u) h y (u) translates into ;y 个 x* and k(y, v) —> v. 
By the monotonicity of F we see that then also 


1 - F(y-\-b(y)v) 

^1 - F(y )~ 


—(1 + yv)~ 1/y . 



If also U is continuous, then the converse derivation holds as well. 
Note that the substitutions y = U (x) and v = h y (u) yield 


b(y + vb(y)) _ a(U^~(y + vb(y))) 
b(y) a(U^(y)) 


a(U^(U (x) + vb(U (x)))) 
a(x) 


U y — \ y V % 


( 2 . 20 ) 


Indeed, the last step follows again from the basic relation (C K ) since for x —> oo 


U (x) + vb(U(x)) = U(xu) + (v — k(x, u))a(x) = U(xu) + o(l)a(x) 
=U (xu)(l + 0 ( 1 )) 


together with the uniform convergence theorem 2.3(i). 

The statement of Proposition 2.2 needs no proof in one direction as the choice 
of y provides a necessary condition immediately. That the condition is also suffi¬ 
cient is less obvious and needs interpolation type arguments as can be found in de 
Haan (1970). 

To prove the statement of (ii) recall that when x ^ l, then logx 〜 x — 1. If 
y < 0, then U(xu)/U(x) —> 1 and hence log(U(xu)/ U(x)) ~ U(xu)/U(x) — 1. 
Multiplying by U(x)/a(x), we can use (C r ) to see that 


U (x) 

⑽- l 0 g [ / ( x )} 〜 


U (xu) — U (x) 
a(x) 


u y - 1 

y 


( 2 . 21 ) 


as x —> oo. In case y > 0, the quantity on the left tends to log u since 
U(x)/a(x) y~ l while U(tx)/U(x) —> u y . Combination of the two statements 
leads to the condition for x —> oo 


{log U(xu) - log U(x)} ^ 

a(x) 


logM, if y > 0, 

(C K ) 

if / <0 

provided > 0. 
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(iii) In case y > 0, U{x)/a{x) —> 1/y. Also, in section 2.6, it was outlined that 
when y > 0 ei og z(logx) -> y and hence e\ og x (log U (x)) —> y as x —> oo. Indeed, 
limit and integral can be interchanged there using the dominated convergence 
theorem based on the Potter bounds 2.4(iv). So we find indeed that in this case 


U(x) 

a(x) 


eiogX(log[/(X)) 


On the other hand, when y < 0, we have 

30 \og(y/U(x)) 


eio g x(logf/(x)):= 


lu(x) (1 — F)(U (x)) 


dF(y) 


—x 




dd- F(y)) 


log 


U (v) dv 
U (x) v 2 


(log U(wx) — log U (x)) _ 


dw 




where a first substitution 1 — F(y) = ^ was followed by a second v = wx. 

Now, when y < 0, apply (2.21) to see that under (C r ) the requested result 
follows. This final limit can be taken using the dominated convergence theorem 
with the help of Potter type bounds on the (2.21), which can be found, for instance, 
in Dekkers et al. (1989). 



3 

AWAY FROM THE MAXIMUM 


3.1 Introduction 

We start with a simple observation. It would be unrealistic to assume that only the 
maximum of a sample contains valuable information about the tail of a distribution. 
Other large order statistics could do this as well. In this chapter, we investigate how 
far we can move away from the maximum. If we stay close to the maximum, then 
only few order statistics are used and our estimators will show large variances. 
Running away from the maximum will decrease the variation as the number of 
useful order statistics increases; however, as a side result, the bias will become 
larger. If we plan to use the subset {X n -k-\-\, n , ^n-k-\- 2 ,n ,. • • ， ^n,n] of the order 
sample values 


^l,n ^ ^2,n ^ ^ ^ ^ ^n-l,n ^ X n , n 

then we need to find out how to choose the rank k in the order statistic X n -k-\-\, n 
optimally. What is clear is that k should be allowed to tend to oo together with 
the sample size. But whether k/n needs to be kept small is not so obvious. 

In this chapter, we offer an intuitive guideline that is developed from a closer 
look at the asymptotics of the larger order statistics. Again, for a practitioner, it 
would look awkward to only use the largest value in a sample to estimate tail 
quantities, especially as this value may look so large that it seems hardly related 
to the sample. From this analysis, it will follow that the choice of the number of 
useful larger order statistics will also depend on the second-order behaviour in the 
relations (C y ) and (C*). These will be developed in this chapter too. Mathematical 
details are deferred to a final section. 
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3.2 Order Statistics Close to the Maximum 

There is a great variety of possible limit laws for the individual order statistics. If 
the index k is small, we can expect that the limit behaviour of X n -k-\-\, n is similar 
to that of the maximum X n n . We derive the corresponding theorems first. 

To do anything explicit with the k-th largest order statistic, we need its distri¬ 
bution. This can be rather easily obtained from a combinatorial argument. We look 
for the probability that X n —k+i， n does not overshoot the value x. To do that, take 
any of the n elements from the sample and force it to have a value u at most x. 
The remaining n — 1 values have to be distributed binomially so that k — 1 of the 
other sample values lie to the right of u, while the remaining n — k values sit to 
the left of u. Therefore, 

P{X n - k+l , n <X}^ \ / F n ~ k (u){\ - F{u)} k ~ l dF{u). (3.1) 

(k - \)\{n -k)\ % 


Case 1: k fixed 


We follow the procedure that gave us the solution of the extremal limit problem. We 
again use Theorem 2.1 and take z, a real, bounded and continuous function. Then 


E{z(a~ l (X n - k+Un - b n ))} 
n\ r°° 


b n 


(k — l)\(n — k)\ J-co \ a n 
The substitution F(x) = 1 — - turns this into the form 


F n ~ k {x){l - F{x)} k ~ l AF{x) 


E{z( < ci n (^X n —j c ^.\ n _ 办 n))} 
1 n\n~ k rn 


{k-\)\{n-k)\J 0 


z(a^ l (U(n/u) - b„))u 


k-\ 


v n—k 


du. 


From what we learnt about the maximum, where k = l, the argument of z does 
not depend upon k. Therefore, we still can take b n = U(n) and a n = a(n) as in 
the case of the maximum. Without much trouble, we see that if F satisfies C y (a), 
then, as n —> oo 

£ j z m 卜忐乂 00 抑 y 獅 e 、 〜 (3.2) 

since for large n and fixed k, n\/(n — k)\ ^ n k . We can interpret the right-hand 
side as an expectation with respect to a gamma density. If we denote by {Ej}J^ =1 
a sequence of i.i.d. exponential random variables with mean 1, then (3.2) can be 
written in the form 


^n-k-\-\,n — U (jf) 

a(n) 


v 



(3.3) 
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Case 2: k ^ oo, n — k ^ oo 


In what follows, we assume that k,n — k and therefore n all tend to oo. Of course, 
at this point, we do not know in advance what kind of centering and normal¬ 
ization should be used. So, take a centering sequence {b n } of real numbers and 
a normalizing sequence {a n } of positive reals for which a~ x {X n ^k + \ n — b n ) will 
converge in distribution. Recall the formula (3.1) and let z again be any real-valued 
bounded and continuous function on R. Then we want to investigate the limiting 
behaviour of 

^n-k-\-l,n — b n 

a” 




n\ 


b n 、 


(k- l)\(n-k)\ 


an 


F n -\x){\ - F(x)} k ~ l dF(x) 


where [a n ] and [b n ] are norming, respectively centering constants. 

As happened in the previous analysis, it is not surprising that, apart from the 
argument of z, all other ingredients can be controlled, regardless of the properties of 
the underlying distribution. This is done by changing x into v by the transformation 


k / k(n — k) 

_F(x) = — h J - ^ - v. 

n y n 3 


(3.4) 


Why this specific transformation works is fully explained in section 3.4.1. If we 
solve the above equation for x and replace x = x(k, n, u) in the expression Z n , 
then all we have to do is to find appropriate choices of a n and b n such that 

x(k, n, v) — br, 

r n (v) := - ■ ~~ r(v) (3.5) 

for v bounded. Indeed, as shown in section 3.4.1, we have the following cover¬ 
ing result. 


Proposition 3.1 If (3.5) holds for v bounded, then 

£ ^ 1 厂 z(r ⑻ ) e -P 虹 (3.6) 

V a n ) \ ^JItX J-OO 

Since the sum Ej in (3.3) will be ultimately approximated by a normal 

variable when k f oo, the appearance of the normal distribution should not be sur¬ 
prising in view of the central limit theorem. 


The explicit value of x(k, n, v) is easily obtained from (3.4) and is 


x(k, n, v) = U 




=U 





86 


AWAY FROM THE MAXIMUM 


The choice of b n in (3.5) is almost automatic, since 0 is a bounded value for v. 
Therefore, we put b n = x(k, n, 0) = U(n/k). The choice of a n can then be made 
by requiring the convergence of 

"( 1 ( 1 -/ ¥*(*)))-"(!) 

r n (v) = -- > r(v) 

for v bounded. That the choice of a n has to be determined by the limiting behaviour 
of the ratio k/n will then become clear. Here are a few illustrations on how 
conditions on the distribution F or on the tail quantile function influence the 
possible choice of k and as such of a n . 


(i) We immediately get a weak law. Assume k/n ^ 0. If F satisfies C Y («), then 
the choice a n = a(n/k) yields 


r n (v) 


u (i + 


a (t) 


0 = r(v). 


Hence Z n —z(0) and 


X n~k+l,n ~ U (l)_ 


a (f) 

In the case y > 0, we can go one step further. Indeed, since y _1 , we 


also have 


^n-k-\-l,n P 1 


We illustrate the result that the A>th largest order statistic leads to an asymptot¬ 
ically consistent estimator for the extreme value index. In Figure 3.1, we show 
Xn^^xn/ain/k) as a function of n, n = 000, with k = |/^ 0 . 25 」(solid 

line), k = |/^ 05 」(broken line) and k = \n 2 ^\ (broken-dotted line) for data gener¬ 
ated from the Burr(l, 1,1) distribution. 


(ii) Getting a closer look at what happens around n/k, we note that the fundamental 
relation (C y ) suggests that sls k ^ oo 

<) K VV 

Under appropriate continuity conditions, we expect that a n = k— l ’ 2 a(n/k) is a 
good choice since then r n (v) —> —v. Since the normal distribution is symmetric, 
this choice leads for the limiting operation | > 0 to 

^ k X M - u(^) 4 r 〜 _， D. 
a \j) 
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Figure 3.1 Plot of k) versus n with k = Lw 0 . 25 」(solid line), k 

= Ln 0 . 5 」（broken line) and k = \n 2 ^\ (broken-dotted line) for data generated from 
the Burr(l, 1, 1) distribution. 


(iii) Take y > 0. This last result is interesting since it suggests that we might be 
able to construct asymptotic confidence intervals for the unknown index y based 
on this normal approximation. To try this, write the above result in the form 


Vk 


Xn-k+l,n 

<) 



u(i) i| 
«(f) y 1 


-5- y ~ n(0, l). 


(3.7) 


If we hope to get normal convergence of the random quantity on the left, then the 
second quantity needs to have a limit as well. In other words, if we want normal 
convergence of the estimator on the left, we need second-order information on 
how fast U(x)/a(x) tends to 1/y as x —> oo. This speed of convergence will then 
tell us at what speed k is allowed to tend to oo with n. Let us give a simple 
example from the Hall-class where U(x) = Cx y (1 + Dx~^(l + o(l))) for some 
> 0 and real constants C > 0 and D. Then it easily follows that a(x) = yCx Y 
and the condition asks the quantity Vky (|) ^ to converge, or that /: 〜 const x 
n 2 卢 /(1+2/3). We illustrate the convergence of Vk(X n -k-\-\, n — U{n/k))/a{n/k) to 
its normal limit using the Burr(" ， r ， 入） distribution. This distribution belongs to 
the Hall-class with y = (Xr) _1 , C = ” 1 "， D — —1/r and P = l/X. In Figure 3.2, 
we show the histogram of Vk(X n -k+i,n — f/ (n/k))/a(n/k) with k = L" 2 〆 3 」for 
simulated data sets of size (a) n = 100, (b) n = 500, (c) n = 1000 and (d) n = 
2000 from the Burr(l, 1,1) distribution. 
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(iv) Our last choice has nothing to do with extremal behaviour. Nevertheless, we 
include it to show how the basic condition (C y ) for the extremal behaviour is 
replaced by a continuity condition on the tail quantile function. In the central 
part of the sample, the behaviour of X n —/c+i, n is well approximated by a normal 
distribution as can be shown as follows. Assume that k tends to oo in such a way 



-4 -2 0 2 4 


⑻ 



-4 -2 0 2 4 


(b) 

Figure 3.2 Histogram of Vk(X n -k^-i, n — U(n/k))/a(n/k) with k = \n 2 ^\ for 
simulated data from the Burr(l, 1,1) distribution: (a) n = 100, (b) n = 500, 
(c) n = 1000 and (d) n = 2000. 



AWAY FROM THE MAXIMUM 


89 



(c) 



(d) 


that the ratio kin X e. (0, 1). Since now the argument in U(n/k) ~ b n does no 
longer tend to oo, the condition on F or f/ refers to finite values of the argument. 
This time, we use the continuity of U. For, one can then show that if F has a 
density / that is continuous and strictly positive at U (A, -1 ), then 



^n-k-]-\,n _ U 



^ N(0, 1) 
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or 

斗一 — " ⑴卜 ( 。 ’ 嚴 |y). (3.8) 

This last result can be fruitfully applied if one needs to estimate central quantiles. 
However, note that there is hardly any relationship between the behaviour of the 
largest order statistics and quantiles away from the tails. 

What did we learn from the above analysis? That we are entitled to use more of 
the larger order statistics, that asymptotic consistency is not too difficult to obtain, 
but that the bias as it appears in (3.7) will only be controllable if we have some 
additional second-order information on the tail of U. Alternatively, the choice of 
the number of useful larger order statistics will also be decided by the second- 
order behaviour of the functions F and U. In the next step, we will use more order 
statistics, but the same phenomena will play a fundamental role. 


3.3 Second-order Theory 

This section covers second-order results on the condition (C y ). The discussion of 
their equivalent versions for distributions is given in section 3.3.3. 


3.3.1 Remainder in terms of U 

Assume that F satisfies C y (a) where a G 1Z Y . In this section, we derive a remainder 
of the limiting operation expressed by the (C K )-condition. Let there exist a second 
(ultimately) positive function a 2 (x) 0 when x oo such that 

U (ux) — U (x) 

- h y (u) 〜 a2(x)k(u), x 个 oo (3.9) 

a(x) 

for all u > 0. We know from the previous section that for some s.v. i, a(x)= 
x y i(x). Moreover, with this real value of y we also have that h y (u) = v y ~ l dv. 
Therefore, for u, v > 0, we have the relations 


and 


ky(UV) = U Y hy(v) + hy(u) = V Y hy(u) + hy(v), 


-hy(u) 


(3.10) 

(3.11) 


U Y h-y(u) = hy(u). 


(3.12) 


We first derive an equation for k in (3.9). To get the latter, we follow the usual 
approach of replacing u in the equation by uv and subsequent rewriting. We get 


U (uvx) — U (x) = {U (uvx) — U (ux)} + [U (ux) — U (x)}. 
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A somewhat tedious calculation yields the fundamental equation 


-\-a2 l (x) 


: (ux) 

U (ux) — U (x) 
a(x) 


a^ix) 

U (uvx) 


U (uvx) — U (x) 


a(x) 
U (ux) 


a(ux) 
-h Y (u)\-\-a2 l (x) 


- hy(v) 

a{ux) 


hy{uv)^ 

a(ux) aiiux) 


a(x) 


a(x) a 2 (x )) 
hy(v)-\-hy(u) — hy(UV)} 


The third term on the right can be simplified by using (3.10) and the fact that 
a is regularly varying with index y. It yields 


a ^ 1 (x)hy(v) 


a(ux) 

a(x) 


u y \ = u y a ^ 1 (x)hy(v) 


i{ux) 


The above equation leads to an equation for k when x —> oo since from (3.9) (in 
case k is not a multiple of h y ) we have either the convergence of a 2 (ux)/a 2 (x) 

or, equivalently, the convergence of a^ix) — lj for all u > 0. 

The first implies regular variation of - So, we assume from now on that 

a2 g 1Z P , p < 0. (3.13) 


The alternative condition on l can be written in the form 


l(ux) — l(x) 
a 2 (x)l(x) 


m(u). 


(3.14) 


However, this is precisely a limiting relation as discussed in section 2.1. Clearly, the 
auxiliary function a 2 (x)l(x) is of p-regular variation and therefore m(u) = ch p (u) 
for some constant c. This result is in accordance with what we know from the 
theory of slow variation with remainder as it has been developed by Goldie and 
Smith (1987). Alternatively, we can say that logi satisfies C p (^ 2 )- 

Using this additional information in the fundamental equation, we arrive at the 
following functional equation for the function k 


k(uv) = u y+p k{v) + k{u) + cu y h y (v)h p (u) (3.15) 


which is valid for all u, v > 0. The derivation of the solution k in (3.15) is given in 
the last section of this chapter. Of course, one could consider a number of subcases 
resulting from the possible values of both y and p. The derivation in section 3.4.3 
shows this to be unnecessary. The next result is basically due to de Haan and 
Stadtmuller (1996). 

Theorem 3.1 Assume U satisfies C y (a); assume further that for all u > 0 

U(UX) ~ UM -h y (u)^a 2 (x)k(u), xtoo 
a(x) 
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where a 2 — > 0 belongs to 1Z P with p < 0. Then for some real constant c 
a^ix) 

Furthermore, for this constant c and an arbitrary constant A G M, 
k(u) = A hy^ p (u) + c y* t y ~ l h p (t) dt. 

If /) < 0, then an appropriate choice of the auxiliary function a results in a sim¬ 
plification of the limit function. 


a(ux) 

a(x) 


—> c u y h p (u). 


Proposition 3.2 Under the conditions of Theorem 3.1, if p < 0, then 


1 

lim - 

X—OO a2(x) 


(U (ux) — U (x) 
\ a{x) 



= chy-\.p ( m ), 


where c = p~ l c + A and a(x) = a(x) (l — p~ { ca 2 (x)) for all x such that < 

\p/c\. 


3.3.2 Examples 

We give a couple of distributions that often appear in applications and for which 
the second-order quantities can be easily derived. 


Weibull distributions 


A distribution that is often employed in reliability theory is the Weibull distribution 
defined on M+ by the tail expression 

1 — F(x) — exp{—Cx^} 

where both C and p are positive. This distribution is to be distinguished from 
the extreme value Weibull type discussed in Chapter 2. For = 1, one finds the 
exponential distribution while for 二 2, one gets the Rayleigh distribution. The 
tail quantile function is easily found and equals U (y) = (C _1 \ogy) 1 ^. It then 
easily follows that a(x) = (logx)^ 1- ^^^. Hence, F satisfies Co (a). 

Moreover 


U (xu) — U (x) 
a(x) 


— log U = P log X 



logM \ l/P _ J 
logx/ 


— log w 


l-/i(log M ) 2 

logx 


so that p = 0, a 2 (x) = (logx) -1 , c = - and k(u) = |(logw) 2 . The case where 

= 1 is particularly simple as it should be. 
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Hall-class 

We have already hinted at the Hall-class of distributions that is defined in terms 
of the tail quantile function by an expansion of the form 


U(x) = Cx y 



袅 +尝+ 


where C > 0, y > 0 while 0 < < • • •. The calculations for the second-order 

quantities are particularly simple and lead to a(x) = yCx Y and therefore 


U (xu) — U (x) 
a(x) 


_ hy (W) = 


y 


x~^hy-p(u) + ... 


so that p — —p, c = 0, ai{x) — x -口 while k(u) = Dl ^~^h y -p(u). 
As a special example, we look at the Burr(^, r, X) distribution where 


1 - F(x)= 



With C = r]~, y = (A.r) _1 , D\ = — r— 1 and # = 入 - 1 this distribution turns out to 
be a special case of the general Hall-class. 


3.3.3 Remainder in terms of F 

We transform the remainder condition that has been stated above in terms of U 
towards a statement based on the distribution function F. This can sometimes be 
useful in verifying such remainder condition on specific examples. Also, in the 
subsequent statistical chapters, these will be of use. 

Condition (3.9) can be written in the form 

U(ux) = U(x) a(x)h y (u) + a\(x)k x (u) 

where we put a\{x) = a{x)a 2 (x) and where k x {u) —> k(u) as x —> oo. 

We operate on the above equation with the function 1 — F. By continuity, the 
left-hand side turns into We can also replace x~ l by (1 — F)(U(x)). We 

then obtain 

1 (1 — F) (U (x) + a(x)h y (u) + a\{x)k x {u)) 

u = (1 - F) (U(x)) 

We now define y = U(x) and replace a(x) by h(U (x)). The above relation turns 
into the form 

1 _ (1 - F) (^ + h(y)h y (u) + h(y)x(y)K y (u)) 


u 


(1 - F) (y) 
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where we have defined 


XW (U^(y)) (3.16) 

and 


Ky{li) I = tf/ —(3 ； )(W). 

A more transparent form of the above equation is the following 

丄 — (1 - F) (y + h(y)v(u, y)) 
m (1 - F) (y) 

where 


(3.17) 


(3.18) 


v(u, y) := h Y (u) + x(y)K y (u). 

The solution of this equation is given in the final section and leads to the following 
result. Here again, we use the notation rj y (v) = (1 + yv)~ l ^ Y . 

Theorem 3.3 Assume U satisfies C y (a) and is continuous. Assume further that for 
all u > 0 


U (ux) — U (x) 
a(x) 


— h y (u) 〜 a2(x)k(u) 


X 个 oo 


where — > 0 belongs to 1Z P with p <0. Then for y 






1 - F(y + vh(y)) 
1 — F(y) 


riy(v)\ 少⑻ 


where x(y) = a 2 (U^(y)) and = r] l y +y (v)k 

The converse holds too since all steps can be reversed. Similarly to Proposition 3.2, 
also here a simpler limit function can be obtained when replacing a by a. 


Proposition 3.4 Under the assumptions of Theorem 3.3, we have for p < 0 


1 

lim - 

X (y) 


1 - F(y- {- vh(y)) 
^1 - F(y )~ 


r]y(v) \ = K +y 00h y+jO I 


、如⑻ / 


where h(y) = a(U^(y)). 


3.4 Mathematical Derivations 

In this section, we give sketches of the proofs of the results in the previous sections. 
We first prove the general auxiliary result (3.6) without reference to properties of 
the underlying distribution or tail quantile function. Then we indicate why (3.8) is 
valid. We then turn to the results concerning second-order theory. 
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3.4.1 Proof of (3.6) 


We need to rewrite the integrand in such a way that both factors F n — k and (1 — 
F) k ~ l can be handled simultaneously. To achieve that, substitute 1 — F(x ):= 
q n p n v where the sequences q n and p n will be determined in due time. We also 
write q n := l — q n for convenience. With this substitution, we can write 


Z n 


n\p n 


(k- 1)!(« - A:)! 

Pn 


Pn 

qn 

Pn 


z(r n (v))(q n + p n v) k ~ l {q n - p n v) n ~ k di; 


h(qn, Pn) / z{Tn{v))h{q n , Pn, V) dv 


qn 

~~Pn 


where we have abbreviated 


U\ 


r n (v ):= 


Qn+Pn 


_ bn 




for convenience. Furthermore, we put 


Pn 


k-l 


Pn ? ^) • = I 1 + ^ 

V qn 


Qn 


n—k 


and 


I]Xqn ， Pn): = 


nlpn 


(k- l)l(n-k)l 




We first deal with /〗.We take its logarithm and expand to get 


logI 2 (q n , Pn, v) = p n v 


k- 1 

Qn 


k 


Qn 


~P 2 n 


k 




+ 


It is now natural to choose q n in such a way that the first term on the right anni¬ 
hilates. The choice of p n can then be made to make the coefficient of —\v 2 equal 


to 1. An exact solution would lead to q n 


k- 


and pi 


(k-l)(n-k) 


first term still annihilates if we stick to the simpler choice 

k 


(«-D 3 


.However, the 


Qn 


pl = k(n — k)n~ 


With this choice, one can then prove that indeed hiq n ， p n , v) — e 2 . With the 

help of Stirling’s formula m! 〜 e~ m \PTn, it takes a bit of simple calculus 

_i 

to verify that also I\ (q n , p n ) (2tt)— 2 . 

The reason for the above calculations is to show that, once the quantities q n and 
p n are chosen as above, all information on the underlying distribution is contained 
in the remaining quantity r n (v). 
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It remains to show that the expression (3.6) is valid. To do that, let € be any 
positive small quantity. Choose T large enough to have that 0(—T) = 1 — 0(7) < 
€. That quantity T will need to satisfy some more conditions. If M is a bound for 
the bounded function |z|, then 


Pn 


h(q n , Pn)z n (v)l2(q n , Pn,v)dv 


< M 


, M. 

Pn 


h(qn, Pn)h(q n , Pn, V) dv. 


Turning back to the expression with 1 — F(x) = q n + p n v =: s as integrating vari¬ 
able, the integral on the right can be rewritten as 


A ：= 


•Sr 


h(q n ， Pn)h(qn, Pn, v) 


n\ 


— \)\{n — k)\ Jq n -\- Pn T 


(1 一 广 


However, on the interval (q n + p n T, 1), the quantity ((5 1 — q n )/(p n T)) 2 is not 
smaller than 1 and hence 


Ji < 


n\ 


(k-mn-k)\p 2 n T 2 J 0 


0 — q n ) 2 (l - s) n ~ k s k ~ { 


P 2 n Tl 


k{k + 1 ) 

0+1)("+ 2) 




n l 


A 


0(T~ 2 ) 


which can be made smaller than € for T large enough as long as lim sup f < 1. A 
similar argument holds for the other limit. 

Combining the above estimates, we can write 

\z„ - 1= [ Z(r ⑻) e- 扣 2 di>| 


V2jt , 


< M 


h (qn, Pn)h(q n , Pn ， V) 


\fl7X 


_ 1 ,.2 


du + 4Me. 


This proves the important auxiliary result. 


3.4.2 Proof of (3.8) 

Remember that we take k/n —> A. g (0, 1). We return to the crucial quantity r n (v). 
Rewrite this in the form 


U 


^71 (^) = 




— (k/n) y 
k/n 


…㈤ ))-"(f) 




~(k/n) v 




- (k/n) v \ 
k/n ^/n I 


k/n y/n \ 

Part of this quantity is of the order when the tail quantile func¬ 

tion has a derivative at the point j. If we choose ^/n a n = 1, then under the latter 
condition 


〜⑻…⑻一 o 
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Then r(v) = —Cxv for some function C 入 > 0. It follows that for continuous, 
bounded functions z, 


E\z[Vn[ X n _ M ， n - U 


z(y)e 2C ^ dy. 


Let us formulate the above result in terms of F. 

Assume that ^ A. g (0, 1) and that F has a density / that is continuous and 
strictly positive at U (A. -1 ). Then 


入 （1 — A.) 


I ^n-k-\-l,n — U 


卵， 1) 


I ^n-k-\-l,n - U 1 


D a, /rv 入 (1 — 入） 


which is (3.8) as we will now show. We have already shown that s/n(X n -k-\-i, n — 
U (|-)) N{0, C^). However, since F has a derivative /, by the definition of 

inverse function, the relation (1 — F)(U (>^)) = y~ { is the same as in (2.17). But 
then 


—1 /I - A A 2 — V 入 (1 - 入 ) 

Cx = I ^ x = /(f/(i)) 

which is equivalent to the statement (3.8). Observe that the two boundary cases 
入 = 0 and 入 =1 had to be excluded. 


3.4.3 Solution of (3.15) 

We follow the approach by Vanroelen (2003). Define 

f x a(u) 

W(x) := U(x) — / - du. 

J\ u 

Then it is easy to derive the following auxiliary equation for the function W: 

W(xu) — W(x) 1 " U (xu) — U (x) 

a 2 (x)a(x) a^(x) \_ a(x) 

1 「 a(xu) 1 du 

-- - u — . 

a 2 (x) [ a(x) 」 u 

The first part of the right-hand side converges to k(u) in view of our second-order 
condition on the function U. By the condition on the auxiliary function a 2 , also 
the second part of the right-hand side converges. But then the left-hand side of 
the above expression converges as well. Automatically, the limit has to be of the 
form as predicted in section 2.1. The auxiliary function on the left is of regular 
variation with index p y and hence the limit on the left is of the form Ah p ^. y (u) 
for some constant A. Solving for k(u) gives the requested expression. 
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3.4.4 Solution of (3.18) 

The equation has to be solved for the quantity u. Two cases arise, depending on 
the value of y. 

(i) : y ^ 0 Substitute h y (u) = ^(u y — 1). Then 


〆 =(1 + yu) 


YX(y)Ky(u)\ 

\-\-yv J 


Taking the y-th root, we can also write this in the form 


r] y (v) 


XXOOMw) l 7 

\-\-yv J 


But then we arrive at the following expression for the remainder condition 


F(y^-vh{y)) 

1 - F(y )~ 


rjy ⑻ =rjy(v) 


rx(y)^ y M\~ y 

l-\-yv ) 


Divide both sides by x(;y) 一 0 when y 个 x*. Apply the usual approxima¬ 
tion on the right-hand side to see that this side is asymptotically equal to 
rj y (v)(l + yv)~ [ K y (u(v)). However, when ;y 个 x* also u(v, y) -> so 

that by the continuity of F and the condition k x (u) —> k(u) we ultimately 
find 




1 - F(y + vh(y)) 
^1 - F(y )~ 




r] X Y +Y (v)k I 


. 如 ⑻〆 


=: 伞 （ v). 


(ii) ： y = 0 A similar approach applies with the equation 

log w + x(^)K y (u) = v(u, y) 


that has to be solved for u. Following the same path as before, the limit 
quantity \//(v) is now equal to e~ v k(e v ), which coincides with the previous 
limit if we straightforwardly put y = 0. 



4 

TAIL ESTIMATION UNDER 
PARETO-TYPE MODELS 


In this chapter, we consider the estimation of the extreme value index, of extreme 
quantiles and of small exceedance probabilities, in case the distribution is of Pareto- 
type, that is, 


F(x) — x~ x ^ Y ip{x), 


or equivalently 


Q(l-l/x) = U(x) = x y tu(x), 


where ip and ijj are related s.v. functions as shown in section 2.9.3. We also 
discuss inferential matters such as point estimators and confidence intervals. 

Since the early eighties of the twentieth century, this problem has been studied in 
great detail in the literature. Hill’s estimator (Hill (1975)), which appeared in 1975, 
continues to be of great importance and constitutes the main subject of this chapter. 
However, to get a better feeling for the choice of possible estimators, we start out 
with a few examples of naive estimators. What they all have in common is an attempt 
to avoid the unknown and irrelevant slowly varying part i. We assume from now on 
that we have a sample of i.i.d. values {X,-; I < i < n] from a Pareto-type tail 1 — F. 

Pareto-type tails are systematically used in certain branches of non-life insur¬ 
ance. Also in finance (stock returns) and in telecommunication (file sizes, waiting 
times), this class is appropriate. In other areas of application of extreme value 
statistics such as hydrology, the use of Pareto models appears to be much less sys¬ 
tematic. However, the estimation problems considered here are typical for extreme 
value methodology and at the same time the Pareto-type model is more specific and 
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simpler to handle. So this chapter has also an instructive purpose; the heavy-tailed 
distributions are an ideal ‘playground’ for developing effective methods that are to 
be extended in the general case y g M. 


4.1 A Naive Approach 

Let us try some easy ways to get rid of the function i\j . From Proposition 2.4, we 
see that for x —> oo, 

log U(x) = y \ogx + log iu(x) ~ y log x • 

Hence, it looks natural to replace in the above expression the deterministic quantity 
f/ by a random quantity whose argument goes to infinity with the sample size. For 
simplicity, the argument x could be taken to be n or more generally n/k. In the 
sequel, we set U n (x) = Q n (l — 1/x) for the natural empirical estimator of U. We 
then expect to have a probabilistic statement of the type 

八 （ n\ n 

log 七 ) 〜〜. 

However, for any r g {1, 2,..., n], one has 



and so we expect asymptotically that for n —> oo that 10 呂义 „_ 灸 + 1 / 〜y \og(n/k). 
From (3.3), it follows that, replacing a(n) by yU(n), when k is kept fixed, 

log (^f) = OMa 

or 

log X n - k+hn -y\ogn -\ogiu(n) = 0 P {\) 

from which, with Proposition 2.4, one indeed derives that if F satisfies (C y ) and 

y > 0 

logZ n _^ + i,„/logn A y. 

This simple result shows that a single larger order statistic can be used to 
estimate the extreme value index y. But there are some serious drawbacks to this 
naive approach. For example, it looks indeed unsatisfactory to use only one single 
order statistic in the estimation procedure. Also, what does it mean to keep k 
fixed? Moreover, from the derivation it follows that the rate of convergence is 
logarithmically slow. 
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By basic statistical intuition, we can foresee that an estimator based on more 
order statistics will be more reliable. One possibility is to consider differences of 
two different extreme order statistics, such as (see Bacro and Brito (1993)): 

logX w — 众 +i， n — log X n - 2 k-\-l,n 

i^g2 


or generalizations with spacings of different order than k. Using the regular vari¬ 
ation of Pareto-type tails, it can easily be seen that this estimator is consistent 
if A: —> oo and | ^ oo. It turns out that this statistic improves the consistency 
rate considerably with respect to the first naive estimator, but still uses only two 
extreme observations. Hill’s estimator will improve on this aspect considerably. 
But even then, we need to know what large order statistics can be used in the 
procedure. From the derivations in the previous chapter, we could deduce that, if 
the sample size tends to oo, then also k should be allowed to do the same, albeit 
at a certain rate. 


4.2 The Hill Estimator 

There are at least four natural ways to introduce this estimator. All of them are 
inspired by the previous analysis. Moreover, the estimator enjoys a high degree of 
popularity thanks to some nice theoretical properties but in spite of some serious 
drawbacks. 


4.2.1 Construction 


(i) The quantile view. The first source of inspiration comes from the quantile 
plots of the Pareto-type distributions. 


(a) These distributions satisfy 


log 6(1 - p) 
-logp 


—> y, as p ^ 0. 


From this, it follows that a Pareto quantile plot, that is, an exponential 
quantile plot based on the log-transformed data, is ultimately linear 
with slope y near the largest observations. 

(b) Moreover, the slope of an ultimately linear exponential quantile plot can 
be estimated by the mean excess values of the type Ek, n as discussed 
in section 1.2.2. 


Combining these two observations leads to the mean excess value of the 
log-transformed data, known as the Hill estimator (Hill (1975)): 


I k 

Hk,n ~ T 〉: log ^n-j-\-l,n _ log ^n—k,n • 


(4.1) 
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An important question can be raised concerning the optimality of the Hill esti¬ 
mator as an estimator of the slope of a quantile-quantile QQ-plot. In fact, the 
vertical coordinates log X n -j + \^ n are not independent and do not possess the 
same variance, and hence summarizing the upper part of the Pareto quantile 
plot (log = ( 宁 ) ， logX„ -j+i’n )，j = 1 ， ... ，众 ， using a least-squares line 
y = log X n —k, n y(x — log((n + l)/(k - 1))) does not seem to be efficient 
since the classical Gauss-Markov conditions are not met. 

Combining the information over a set of possible y-values, we can look for 
the least-squares straight line that fits best to the points 


log 


n l 


log X n —\ ■> j = 1 ，.. •，众 + 1| 


where we force the straight line to pass through the most left of these points. 
Such a line has the form 


J = log X n — k ， n + y - log ■ 


.log 


n l 
k+ 1 


A little reflection indicates that it might be wise to give points on the right of 
the above set variable weights in view of the above-mentioned problem of 
heteroscedasticity. In order to find the least-squares value of y, we therefore 
minimize the quantity 


E 


w J,k 


log - (log x n -^n + Y log 


众 + 1 


where {wjy, j = l,k] are appropriate weights. A simple calculation 
tells us that the resulting value of y, say yk, is given by 


Yk = Yl a J-k lo S 


Xn—k,n 


where 


a J,k 


切 M lo g 


k+\ 


Ef =1 叫 (log 乎 ) 

When choosing = l/k one arrives at the Hill estimator. 

(ii) The probability view. The definition of a Pareto-type tail can be rewritten as 


1 — F(tx) 
1 - 


-i/y 


sls t ^ oo for any x > l. 


This can be interpreted as 

P(X/t > x\X > t) ^ x~ l ^ Y for t large, x > l. 



TAIL ESTIMATION UNDER PARETO-TYPE MODELS 


103 


Hence, it appears a natural idea to associate a strict Pareto distribution with 
survival function x~ x ^ Y to the distribution of the relative excesses Yj = Xi/t 
over a high threshold t conditionally on Xi > t, where i is the index of the j- 
th exceedance in the original sample and j = 1,, N t . The log-likelihood 
conditionally on N t then becomes 

logL(Fi, … ， h) = -N t \ogy - + -J [log7/. 


Since 


dlogL 

dy 


Nt_ 

y 


l 从 

+ ^E lo g y ^ 

^ 7=1 


the likelihood equation leads to 


1 外 

;=i 


Choosing for the threshold t an upper order statistic X n —k， n (so that N t = k), 
we obtain Hill’s estimator again. For t non-random, we get the ratio estimator 
of Goldie and Smith (1987). 


(iii) RenyVs exponential representation. There is an alternative way of writing 
the Hill estimator by introducing the random variables 


Zj : = j (log j-\-\,n _ logX n _j ?n ) =1 j Tj 

that will play a crucial role later. Through a partial summation, one finds 
that 


k k k j k k 

E z j = Ej' t j = EE T j = EE 7 / 

7=1 )=1 )= 1，=1 i=l j=i 

which easily leads to the crucial relation 

1 A - 

Hk,n = ^ 2J Z) = Z 灸 . 

K ;=1 

Thanks to a remarkable property about the order statistics of an exponential 
distribution that was discovered by A. Renyi (see further in section 4.4)，it 
follows that in case of a strict Pareto distribution, the transformed variables 
Zj are independent and exponentially distributed: 


= yEj, j = \ ， … ， k ， 
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with (E\, E 2 ,...) being standard exponentially distributed. This exponential 
representation can be interpreted as a generalized linear regression model 
with Zk the obvious maximum likelihood estimator of y. Expressing a tail 
index estimator in terms of spacings between subsequent order statistics 
follows intuition. For, samples from distributions with heavy tails will be 
characterized by systematically larger gaps when the index j decreases. 


(iv) Mean excess approach. Still another alternative derivation is based on the 
mean excess function of the log-transformed data. If 1 — F e lZ-\/ Y with 
y > 0, then as derived in section 2.6 


E{\ogX - logx\X > x}= 



1 — F(u) du 
1 — F(x) u 


—> y as x —> 00 . 


Replace the distribution F by its empirical counterpart F n , defined in section 
1.1, and x by the random sequence X n —k, n that tends to 00 . It is then a 
pleasant exercise to show that 

1 - F n (u) du 

付 k，n ~ I 7 • 

^ ^n-k,n 1 _ F n 、 X n —k,n)U 

4.2.2 Properties 

Mason (1982) showed that is a consistent estimator for y (as k,n ^ 00 , 
k/n ^ 0) whatever the slowly varying function Ip (or iu) may be. This is 
even true for weakly dependent data (Hsing (1991)) or in case of a linear pro¬ 
cess (Resnick and Starica (1995)). Asymptotic normality of n was discussed 
among others in Hall (1982)，Davis and Resnick (1984), Csorgo and Mason (1985), 
Haeusler and Teugels (1985), Deheuvels et al. (1988), Csorgo and Viharos (1998), 
de Haan and Peng (1998) and de Haan and Resnick (1998). In Drees (1998) 
and Beirlant et al. (2002a), variance and rate optimality of the Hill estimator was 
derived for large submodels of the Pareto-type model. 

However, several problems arise. 


(i) For every choice of k, we obtain another estimator for y. Usually one plots 
the estimates n against k, yielding the Hill plot: {(k, n ) : l < k < 
n — 1}. However, these plots typically are far from being constant, which 
makes it difficult to use the estimator in practice without further guide¬ 
line on how to choose the value k. This is illustrated by a simulation 
from a strict Pareto distribution (Figure 4.1(a)) and from a Burr distribution 
(Figure 4.1(b)). 

Resnick and StMca (1997) proposed to plot {(log A：, Hfc, n ) ' I < k < 
n — 1}, see also Drees et al. (2000). While this indeed focuses the graphics 
on the appropriate area, this procedure does not overcome some of the other 
problems that are cited next. 
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k 


(b) 

Figure 4.1 (k, /4,«) for simulated datasets of size n = 500 from (a) the Pa(l) 

and (b) the Burr( 1,1,1) distribution. 


(ii) In many instances, a severe bias can appear. This happens when the effect 
of the slowly varying part in the model disappears slowly in the Pareto 
quantile plot. Stated differently within the probability view, the assumption 
that the relative excesses above a certain threshold follow a strict Pareto 
distribution is sometimes too optimistic. This is illustrated in Figure 4.2 
where a Pareto quantile plot from a Burr(1,1,1) is put in contrast with that of 
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Figure 4.3 Median and quartiles of the Hill estimates as a function of k, 
k = 1 ， … ， 400, obtained from 100 samples of size 500 from a distribution. 


a Burr(l,0.25,4) distribution. See also Figure 4.3 for a simulation experiment 
from a 17^41 distribution with y = 0.25 where only for the smallest values of 
k the median of the Hill estimator touches the correct value. 
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Figure 4.2 log U (x) as a function of logx for the Burr(l,l,l) and Burr( 1,0.25,4) 
distributions. 
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A large bias leads to poor coverage probabilities of confidence intervals. In 
many practical cases, systematic over- or underestimation has to be avoided. 
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For instance, in the valuation of a precious stone deposit, systematic overes¬ 
timation of the tail of the carat size distribution will lead to over-optimistic 
predictions. 


(iii) The Hill estimator shares a serious defect with many other common esti¬ 
mators that are based on log-transformed data: the estimator is not invariant 
with respect to shifts of the data. As mentioned by several authors, inade¬ 
quate use of the Hill estimator in conjunction with a data shift can lead to 
systematic errors as well. A location-invariant modification of the Hill esti¬ 
mator is proposed in Fraga Alves (2001). To this end, a secondary 众 -value, 
denoted by ko (ko < k), is introduced, leading to 


Y (H \k 0 ,k) = ^ log 


y+l,« _ X n —k，n 
Xn—lcQ,n _ ^n—k,n 


If one lets both k = k n and ko = ko, n tend to infinity with n —> oo, such that 
k/n —> 0 and ko/k —> 0, one can show that y^iko^k) is consistent. An 
adaptive version of the proposed estimator has been proposed from the best 
theoretical ko given by 


知 〜 [(1 + y)/^] 2/iWy) k 2y/{Wy) 

departing with some initial estimate j)( 0) in a first step; for instance, obtained 
by setting = [2k 2 ^ 3 \. 


4.3 Other Regression Estimators 


The Hill estimator has been obtained from the Pareto quantile plot using a quite 
naive estimator of the slope in the ultimate right end of the quantile plot. Of course, 
more flexible regression methods on the highest k points of the Pareto quantile 
plot could be applied. This programme was carried out in detail in Schultze and 
Steinebach (1996), Kratz and Resnick (1996) and Csorgo and Viharos (1998). We 
refer to these papers for more mathematical details and confine ourselves here to 
the derivation of the estimators. 


(i) The weighted least-squares fit on the Pareto quantile plot as treated in 
section 4.2 can be rewritten in the form 


Y!]=\ w j,k i°g 2 ^ 

Setting K{i/k) = i~ l this estimator can be approxi¬ 

mated by 

，+ k- 1 Eti K(i/k)i (log log X, 

YK ’ k= ^T^KuJk) ’ 
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showing that weighted least-squares estimation leads to the class of kernel 
estimators introduced by Csorgo et al. (1985). Here, K denotes a kernel 
function associating different weights to the different order statistics. How¬ 
ever, Csorgo et al. (1985) also consider kernel functions with support outside 
(0,1]. Optimal choice of K is possible but is hard to manage in practice. 
Weighting the spacings Z[ has the advantage that the graphs of the estimates 
as a function of k are smoother in comparison with, for instance, the Hill 
estimator where adjacent values of k can lead to quite different estimates. 


(ii) The problem of non-smoothness of Hill estimates as a function of k can be 
solved in another way: simple unconstrained least squares with estimation of 
a slope y as well as an intercept, say <5, can already provide more smooth¬ 
ness even without the use of a kernel function. This procedure of fitting 
lines to (parts of) QQ-plots and especially double-logarithmic plots can be 
traced back to Zipf from the late 1940s (see Zipf (1949)). Only recently, this 
procedure has been studied in more depth. 

The classical least-squares procedure minimizing 


(^!og X„- j+Un 一 (3 + y log 
with respect to 8 and y leads to 

„+ 1 I2j=l ( lo S 宁 -I E;:=l log 宁 ) lo g ^-n-j+\,n 

IDS 2 宁 - g 宁 ) 2 
i E；-i ( lo § 乎 —{Ej=i *°g log x„-j + i„ 

_ m =1 iog，-ODg 宁 ) 2 . 

This is the estimator proposed in Schultze and Steinebach (1996) and Kratz 
and Re snick (1996). In Csorgo and Viharos (1998), the asymptotic properties 
of this estimator are reviewed. These authors also propose a generalization 
of this estimator that again can be motivated by a weighted least-squares 
algorithm: 


i 工 k j=\ {jlj-\)/k J ⑷ A) ^n-j-\-l,n 

' i (fu-D/k 抽 )log 宁 

where 7 is a non-increasing function defined on (0,1), which integrates to 0. 
Csorgo and Viharos (1998) propose to use the weight functions J of the type 


: = 


0 + 1 
0 


ie + 1) 2 


^ e [0, 1] 


for some 0 > 0. 
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4.4 A Representation for Log-spacings 
and Asymptotic Results 

In this section, we investigate the most important mathematical properties of the 
Hill estimator and some selected generalizations as discussed above. In particu¬ 
lar, we propose expressions for the asymptotic bias and the asymptotic variance. 
These results will be helpful later when we discuss the adaptive choice of k. The 
given results can also assist in providing remedies for some of the problems cited 
above. 

In section 4.2.1 (iii), we derived that Hill’s estimator can be written as a simple 
average of scaled /og-spacings: 

! k 

tik,n = 〉: Zy with Zij= j (log X n — - log X n —j^n)- 

K 7=1 

We will now elaborate on these spacings. We continue the discussion as started in 
4.2.1 (iii), where we found that in the case of strict Pareto distributions 

Z) = yEj，j = … ， k, 

with {Ei\ 1 < i < n] 3. sample from an exponential distribution with mean 1. In 
accordance with our conventions, their order statistics are then denoted by 

^\,n — 五 2 ,« ^ ^ ^n-k-\-\,n — ' ' ' — E n , n . 

Double use of the probability integral transform leads to the linking equalities 

x hn = U(e E ^), l<j<n. (4.2) 

The main reason for using an exponential sample lies in a remarkable property 
about the order statistics of the latter distribution, discovered by A. Renyi. Indeed, 

k E. 

En—j+l,n 一 ^n—k,n ~~ I < j <k <n (4.3) 

i=i 

where {Ei, l < i < n — 1} is again an exponential sample with mean 1. From this 
equation, one can, for example, derive the expectations of the exponential order 
statistics in that 



if n is large. 
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We now combine the above with the second-order properties of the tail quantile 
function U. From here, we assume that \ogiu satisfies (C— 卩 (b)) for some > Q 
and b e 1Z— 卜 This means that we can write 


U (ux) 
U(x) 


=u Y h-^{u) b(x) + o( 办 (x))). 


(4.4) 


Using the second-order condition (4.4), we expand the distribution of the scaled 
spacings Zj = 7 (log X n -j^ hn - log j = 1 ， … ， k: 


= j log 


X n — 

Xn—j,n 


V .. U(, 

=_/ log — 

■D .. 

=yiog 


>^n—j+l,n — ^n—j,n 

U [e En ~^ n ) 
U {e E j" e En ~^ n ) 


U [e En ~j' n ) 


j { y log e Ej ^ + log [ 1 + 


b (e En ~^) (lo(l))]} 


^ yEj-\- j\og{\ + W nJ ) 


where we used (4.3) and the abbreviation 

W n ,j h-f, (e E i /j ) b(e E "-^) . (4.5) 

We therefore get a stochastic representation for the spacings 

Zj^yEj+j log(l +W„j) . (4.6) 

One way of using this result is to replace the log-term on the right by inequalities 
like 


TT7" log(1 + y) - v 


that yield universal, stochastic inequalities for the Hill estimator since Hjc, n = 
j ^2j=i Zj-. Another possibility is to look at approximations for log(l + y) for y 
small. First of all, it is easy to see that for y small, h-^(e y ) = j(l + o(l)). Next, 
we need to gain a bit more insight into the behaviour of the argument of b(x) 
in (4.5). As long as j/n ^ 0 when n oo, we have E n -j^ n / \og(n/j) ^ 1 (see 


This means 


section 3.2, case 2, (i)). But then b (e En ~^ n ^ = b (^}) (1 + / >(1)). 

that we can approximate j log(l + W n j) in distribution by Ejb Hence, we 

are lead to the following approximate representation: 


4 


v 


y -\-b 


、+ 1 、 

.7 + 1 


Ej ， 


(4.7) 
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or, using the regular variation of b with index —p, 


Z 广 }/ 




1， 


In the sequel, we use the notation b n ,k = b ( 鞋 |). 

The above is just a sketch of the proof of (4.8). In Beirlant et al. (2002c), the 
following result is proven. Similar results can be found in Kaufmann and Reiss 
(1998) and Drees et al. (2000). 

Theorem 4.1 Suppose (4.4) holds. Then there exist random variables Rj n and 
standard exponential random variables Ej (independent with each n) such that 

sup Zj - (y + 〜 ( 士 y)) 五 ; •-/?/，" = op(b n ， k), (4.9) 

as k, n 4 oo with k/n 0, where uniformly in i = 1 , ..., k 

R j,nlj =op [bn.k max (log ， J). 

J =l 

Let us draw some first conclusions concerning the Hill estimator. 

(i) The asymptotic bias of the Hill estimator can be traced back using the expo¬ 
nential representation. Indeed, 

l k / j \ ^ 

ABias(H kn b n ， k~[ 


We notice that the bias will be small only if b n ^ is small, which in turn 
requires k to be small. 

(ii) The asymptotic variance of the Hill estimator is even easier in that 


AVar(H kt „) ~ var j^Ej 


Notice that the variance will be small if we take k large. 

(iii) Finally, the asymptotic normality of the Hill estimator can be expected when 
A:, n —> oo and k/n —> 0. For then, if \fkb n ^ —> 0, 

Vk(— - 1) 三 N(0, 1). 
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This result allows the construction of approximate confidence intervals for 
y • At the level (1 — a), this interval is given by 



(4.10) 


which is an acceptable approach if the bias is not too important, that is, if 
^ > 1. Typically, the condition \fkb n ^ -> 0 severely restricts the range of 
众 - values where the confidence interval works. 


We end this section outlining how the above exponential representation result 
can be used to derive formally the above bias-variance expressions and to deduce 
asymptotic normality results for kernel type statistics 


as discussed earlier in this chapter. Here, we assume that the kernel K can be 
written as K(t) = j u(v)dv, 0 < ? < 1 for some function u defined on (0, 1). 
Such type of results can be found, for instance, in Csorgo et al. (1985) and Csorgo 
and Viharos (1998). 

Theorem 4.2 Suppose (4.4) holds. Let K(t) = j u(v)dv for some function u 
satisfying k u{t)dt < / (^x) f or some positive continuous function f 

defined on (0, 1) such that log + (l/w;) f(w)dw < oo and \K\ 2+8 (w)dw < oo 
for some <5 > 0. Suppose \fkb n ^ = 0(1) as k,n ^ oo with k/n —> 0. Then with 
the same notations as before, we have that 


y + b n 


福 


converges in distribution to a N (0, y 2 K 2 (u)du^ distribution. 

The result follows from the Lindeberg-Feller central limit theorem after show¬ 
ing that 
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which follows from \fkb n ^ = 0(1) and the above exponential representation 
theorem since 




^2 \ 7 / u ( t )dt 1 Rj f n = ^ ^ 


i/k 




u(t)dt 


R 


J,n 


and hence 


k 冲 亡坠， 




u{t)dt 


Rj, 




E 


R 


J，n 


From Theorem 4.2, we learn that the asymptotic mean squared error of kernel 
estimators k equals k~ l y 2 K 2 (u)du + k (fJ K{u)u^du) 2 . In Csorgo et al. 
(1985), kernels are derived (with support possibly outside (0, 1]), which minimize 
AMSE(y^ k ). In case, \ogiu satisfies (C-^(b)) for some > 0 Csorgo et al. (1985) 
found the kernel 


[ 卢 ⑴ = 


l + P 「1 + 2 ( 6 、 1+/! 
~l~\2 + 2ti) 




0 < / < 


2 + 2 卢 
1+2 卢 


to be optimal in the sense that among all kernels satisfying / 0 °° K(u)du = 
f 0 K 2 (u)du = 1 the asymptotic mean squared error is minimal for Kp. Note, 
however, that the optimal kernel depends on which makes the application of 
this approach somewhat difficult. In Csorgo and Viharos (1998), a method for 
implementing this approach is proposed. The estimation of will be discussed 
below. 


4.5 Reducing the Bias 

In many cases, the Hill estimator overestimates the population value of y due to 
slow convergence of b n ^ to 0. Some proposals of bias-reduced estimators were 
recently introduced, for instance, in Peng (1998), Feuerverger and Hall (1999), 
Beirlant et al. (1999), Gomes et al. (2000) and Gomes and Martins (2002). The 
last references make use of the exponential representation developed above. Again, 
one can tackle the problem from the quantile or the probability view. 

4.5.1 The quantile view 

The representation (4.8) can be considered as a generalized regression model with 
exponentially distributed responses. For every fixed j, the responses Zj are approx¬ 
imately exponentially distributed with mean y + b n ,k .If b n ^k > 0, then the 
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Figure 4.4 Plot of Z 7 versus j, j = 1 ， … ， 200, for a simulated sample of size 
n = 500 from the Burr(l,l ， 2) distribution; solid horizontal line: true value of y; 

broken horizontal line: // 2 oo, 500 ; solid curve: y H- b n ^ ; broken curve: y + 




means increase with increasing values of j while the intercept is given by y. This 
is illustrated in Figure 4.4 using a simulation from a Burr( 1,1,2) distribution of 
size n = 500. We show k = 200 points. 

Some simple variations of (4.8) were proposed: 

Zj ~ y exp (cf 以 - ) E j ， 1 < j (4.11) 

with d n ,k = b n ^/y, using the approximation 



Alternatively, changing the generalized linear model (4.8) into a regression 
model with additive noise (replacing the random factors Ej by their expected 
values in the bias term), we obtain 
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Joint estimates of y, b n ^ (or d n ^) and p can be obtained for each k from (4.8) 
and (4.11) by maximum likelihood, or from (4.12) by least-squares minimizing 


E 

;=i 



~ y ~ b n ,k 


+ 1 , 



2 


with respect to y, b n ,k and We denote the maximum likelihood estimator of y 
based on (4.8) by 

In view of the discussion of the properties of the Hill estimator in the pre¬ 
ceding section, we will mainly focus on the case < 1. Remark, however, that 
the regression models are not identifiable when p equals 0, for then y and b n ,k 
together make up the mean response. Necessarily this fact leads to instabilities 
in case is close to 0. An algorithm was constructed in Beirlant et al. (1999) 
to search for the maximum likelihood estimate of y next to estimates b n ^ and 
Pk under (4.8). In order to avoid instabilities in the maximization routines, the 
restriction ^ > 0.5 was introduced for sample sizes up to n = 1000. Simulation 
experiments indicated that this bound can gradually be relaxed with larger sample 
sizes, for instance, to ^ > 0.25 for n = 5000. Also, to avoid instabilities arising 
from the optimization procedure ending at a local maximum, some smoothness 
conditions were added, linking estimates at subsequent values of k: given b n ^\ 

A . 八八 八八 

and 仇 + 1 ， we set \b n ^\ < 1.1|^ w ,a ： +i| and ^ < 1.1 仇 +i. A similar program can be 
carried out on the basis of (4.12) using least squares. The results obtained in this 
way are very similar to those obtained by maximum likelihood. 

The variance of j/jJ L when k — oo and k/n 0 equals ((1 + /3)/^) 4 y 2 /k 
in first order (see Beirlant et al. (2002c)) showing that the variance of these esti¬ 
mators is much larger than in case of the Hill estimator. The asymptotic bias, 
however, is zero as long as Vkb n ^ = 0(1), this is in contrast to the Hill esti¬ 
mator, where the asymptotic bias only disappears for relatively small values of k, 
that is, Vkbn^k 0. The graphs with the resulting estimates as a function of k 
are much more stable taking away a serious part of the bias of the Hill estimator. 
Experience suggests that the largest values of k for which both the Hill and the 
bias-reduced estimator correspond closely provide reasonable estimates for y. The 
use of the regression models to choose k adaptively will be further explored in 
section 4.7. The mean squared errors find their minima at much higher values of 
k in comparison with the Hill estimator; the respective minima are typically of the 
same size. 

Confidence intervals for y can now be constructed on the basis of a bias- 
reduced maximum likelihood estimator with the above-mentioned variance. In 
contrast to the confidence interval (4.10)，this leads to intervals that better approx¬ 
imate the required confidence level 1 — a. This is illustrated in Figure 4.5 using 
simulated samples of size n = 500 from a Burr(l,0.5,2) distribution, for which 
p = 0.5. 
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Figure 4.5 (a) Medians of the estimated standard deviations of the Hill estimator 

(broken line) and the maximum likelihood estimator d (solid line) as a function 
of k, k = 5 ,…， 250 and (b) corresponding coverage probabilities of confidence 
intervals for k = 5,..., 250 based on the Hill estimator (broken line) and the 
maximum likelihood estimator (solid line). 
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In each of the three regression models considered above, one can also solve for 
y and b n ,k ，oy y and d n ^, after substituting a consistent estimator p = Pk,n for 
For brevity, we focus on the least-squares estimators based on (4.12), leading to 


y+ s (P) = Z k - 杜 s (_ + 吾 ) 

LS p 2 k^\\k+i 

7=1 


1 +A 




Here, we approximated A:— 1 X1) =1 C/7(A: + 1)) 卢 by 1/(1 + p) and k~ x Yl k j=\ 
((j/(k + 1)/ - 1/(1 + P)) 2 by p 2 (l + 卢 )— 2 (1 + On the basis of Theo¬ 

rem 4.2, one can show that the asymptotic variance of y^ s equals k~ l y 2 {(\ + 0)/ 
p) 2 . Here, the increase of the variance in comparison with the Hill estimator is 
not so large as with but the question arises concerning an estimator of the 
second-order parameter 卢 . Drees and Kaufmann (1998) proposed the estimator 


Pic,n,X ~ 


log 入 


log 


^lX 2 icj,n ~ ^lXkj,n 
^[X.kj,n _ ^k,n 


for some 入 e (0, 1) and with k taken in the range Vkb n ^ oo. An adaptive 
choice for k in this range is also given. It can also be shown that the estimators 
of p based on the regression models discussed here share this consistency prop¬ 
erty as —> oo. For a more elaborate discussion of the estimation of ^ and 

several other estimators of we refer the reader to Gomes et al. (2002), Gomes 
and Martins (2002) and Fraga Alves et al. (2003). 

The estimation of p is known to be difficult. Hence, some authors have pro¬ 
posed to set )6 = 1 in procedures involving knowledge of The resulting esti¬ 
mators yield a compromise between the bias reduction of the estimators involving 
estimation of p and the smaller variance when using, for instance, the Hill estima¬ 
tor, see, for instance, Gomes and Oliveira (2003). Guillou and Hall (2001) use the 
estimator of b n ,k obtained from (4.12) after setting 0 = 1 in the context of adaptive 
selection rules for the number of extremes k. This will be discussed in section 4.7. 

Finally, we mention that j/jJ L and while not shift-invariant in the math¬ 

ematical sense, are already much more stable under shifts than the Hill estimator. 
The above-mentioned shift-invariant modification of the Hill estimator proposed 
by Fraga Alves (2001) also enables for stable plots. 


4.5.2 The probability view 

Alternatively, Beirlant et al. (2004) propose to use a second-order refinement of 
the probability view. Following the approach discussed in 4.2.1 (ii) where the Hill 
estimator follows from the approximation of the conditional distribution of the 
relative excesses Yj := X n —j+\, n /X n —k, n , j = 1,..., A：, by a strict Pareto distribu¬ 
tion, one can argue that the Hill estimator will break down if this approximation 
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is poor. In order to describe the departure of F t (x) = P(X/t < x\X > t) from a 
strict Pareto distribution, we use the assumption that If satisfies (3.14): 

1 f (tx^) 

- ~ ^ = X— l/y (l + h. z {x)B(t) + o(B(t))), (4.13) 

1 一 F(t) 

where r > 0 and B is regularly varying at infinity with index —r. Condition (4.13) 
can be rephrased as 

1 - F t (x) = x~ Vv [\ - B{t)T~\x~ T - 1) + o(B(t))], 

as / —> oo. Deleting the error term, this refines the original Pareto approximation 
to an approximation by a mixture of two Pareto distributions. The idea is now to fit 
such a perturbed Pareto distribution to the multiplicative excesses Yj, j = l,k, 
aiming for more accurate estimation of the unknown tail. 

Such a perturbed Pareto distribution is then defined by the survival function 

1 — G(x; y, c, r) = (1 — c)x~^ y + cx _1 ’ y - T 

with some c e (—1/r, 1) and x > 1. Observe that if c = 0, then this mixture coin¬ 
cides with ordinary Pareto distribution. 

For c 丄 0, we can write 

1 - G(x; y, c, r) = {x[l + yc(l - x~ T )]}~ l/y + o(c) 

={x[(l + yc) — ycx~ T )]}~ l/y + o(c). 

In practice, it turns out that 

G ppd (x) = X- 1 斤 [(1 + yc) - ycx- T r l/Y (4.14) 

fits well by the maximum likelihood method, leading to estimators Yp PD , Cp PD 
and ip PD . The likelihood surface can be seen to be rather flat in r so that the 
optimization should be handled with care, comparable to the estimation of ^ in the 
generalized linear model (4.8). 

The perturbed Pareto distribution (4.14) extends the generalized Pareto (GP) 
distribution, which will be discussed in depth in Chapter 5, in the following way. 
In statistics of extremes, it is common practice to approximate the distribution of 
absolute exceedances of a random variable Y above a high-enough threshold u by 
the generalized Pareto distribution: 

P(Y — u > y\Y > w) = (1 + —^ ~ , ;y > 0; a > 0. (4.15) 

Replacing y by ux — u with x > l transforms (4.15) into a model for relative 
excesses 

P(Y/u > x\Y > m) = {x — - (― - 厂 7 ’ 

which is clearly (4.14) with c = u/a — \/y and r = 1. 
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4.6 Extreme Quantiles and Small Exceedance 
Probabilities 

In the previous sections on the quantile viewpoint, we fitted a straight line to 
an ultimately linear part of the Pareto quantile plot. Continuing in this spirit and 
following the principle for estimating large quantiles and small exceedance prob¬ 
abilities, outlined in Figure 1.2, we are now in the position to estimate extreme 
quantiles under a Pareto-type model. However, the probability view allows for an 
alternative interpretation of the available methods. 


4.6.1 First-order estimation of quantiles and return periods 


We first discuss the simple approach proposed by Weissman (1978) based on the 
Hill estimator. 

We use the Pareto index estimation method based on linear regression of a 
Pareto quantile plot to derive an estimator for Q{\ — p). Assuming that the ultimate 
linearity of the Pareto quantile plot persists from the largest k observations on (till 
infinity), that is, assuming that the strict Pareto model persist above this threshold, 
we can extrapolate along the line with equation 

y = log X n —k ， n~\~Hk, 

anchored at the point (- log 胃 ， log X n — k ， n). 

Take x = — log p to obtain an estimator q^ p of Q(l — p) given by 

., / k+\ 

Cp = exp ( lo § X n-Kn + H Kn log ^ + 


X + log 


众 + 1 、 
n-\- 


=x ; ( ^+i 

n - k ' n \{n+\)p ) - 

The asymptotic characteristics of this method can be found through the follow- 
ing expansion: since 0(1 - p) = p~ y ljj{\/p) and X n _ kt „ = U^ hn lu(U^： l n ) 
where Uj, n denote the order statistics from a uniform (0,1) sample, we find that 

(Uk+l, n y y Wklx.n) ( k+l \ Hkn ' 

\ P ) ^u(p~ l ) \(w + l)pj 

(U k+ l,n ( k+ l ~(0 

\(k+l)/(n + l)J V(«+l)/7 ； iu{p- 1 ) 


log 


拉 , p 


T> 


Q(i-p) 


log 


v 

=y 


- log 



+ (H kjl - y) log 


^+1 \ 

(n + \)p) 


+ log 


iuip- 1 ) 
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where we used the same notation as in section 4.4. Under condition (4.4)，we can 
approximate the last term by 




(n-\-l)p 




Using again \fk 众， „ _ log ^ 7V(0, 1) as k,n ^ oo and k/n 0 (see 
section 3.2, case 2 (ii)), together with the exponential representation of the scaled 
spacings Zj, we find the expressions for the asymptotic variance and bias of 
the Weissman estimator in the log-scale when p = p n ^ 0 and np n -> c > 0 as 
n oo. We denote the asymptotic expectation by Eoq. 


log 


Kp 


Qd-P), 


ABias(H k ^ n ) log | 


众 + 1 、 


b n ,k_ 


O + i )/? 、 
、 &+ 1 ) 


bn,k 


log I 


k+ 1 


1+/J a \(n+l)pJ 
' 2 / / k+l 


— bn,I 






- ， (4.16) 


AVar(1 ° g A) 〜 1 H 1+log2 U^r^ 


(4.17) 


Furthermore, it can now be shown that when k,n — oo and k/n —> 0 such 


that VkE^ (log 




— 0, 


Vk(l-\- log 2 I 


灸 +1 、 
.(" +1) 广 


- 1/2 




Gd - p) 


-1 ) ^ N(0, y 2 ) . (4.18) 


An asymptotic confidence interval of level 1 — a for 2(1 — /?) when p n —> 0 and 
np n -^c>0asn—>oois given by 




+ 0- 1 (l-«/2)^Jl+log 2 


Jc±]_ y 

(n-\-l)p 


Kp 


1 — 0 " 


i(l_ a/2 )^L /l+log 2 ( 


灸 +i 
in+\)p 


Alternatively, in order to estimate P(X > x) for x large, one can set the Weiss¬ 
man estimator q^ p equal to x and solve for p: 


Kx 


'k+ r 


、"+ 1 乂 \ ^n—k,n / 


k —VH k ， n 
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This estimator can also be directly understood from the probability point of view. 
Indeed, using the approximation 〜::((?)) = P(X > ty)/P(X > f) 〜； y—as r ^ 
oo, we find p^ x when replacing ty by x and, when using X n -k^ n as a threshold t, 
estimating P(X > t) by the empirical estimate 替 
Again, one can prove asymptotic normality in that 


Vk 


(l + y - 2 log 2 



Pk,x 

P(X > x) 



■5 -戰 1). 


(4.19) 


This leads to an asymptotic confidence interval of level 1 — a for P(X > x): 


_ Kx _ 

y l + ^m-a/2)(k(l + (&))) — 1/2 ， 

_ Ptx _ 

1 - - 《 / 2 ) (灸 (1 + 妃 log 2 (^_)))- 1/2 


4.6.2 Second-order refinements 

The quantile view 

Using the condition (4.4)，one can refine q^ p exploiting the additional information 
that is then available concerning the slowly varying function lu ，Using again 
X n -k,n = t/(l/t4+i，《), we find that 


6(1 ~^) g p~ y lud/p) 

X n - k , n ~ U^ h Jud/U M , n ) 


h , n 、 


exp 




b(l/Uk+\,n)' 




'k+l " 
S n + i)P/ 


exp 


bn, I 


(«+!)/? } 


P 


/ 


where in the last step, we replaced Uk+\, n by its expected value Hence, we 
arrive at the following estimator for extreme quantiles with k = 3,..., n — 1: 


^k]p ~ X n —k,n 


灸 + 1 、 
、（"+ 


k ^ML 


exp 


bn,k 


(_k±l_ 






(4.20) 
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Figure 4.6 Median of q^ p (solid line), q^ p (broken-dotted line) and q^ p (bro¬ 
ken line) with p = 0.0002 for 100 simulated samples of size n = 1000 from the 
Burr(l,0.5,2) distribution, k = 5,… ， 200. The horizontal line indicates the true 
value of 2(1 — p). 


where P and bk, n denote the maximum likelihood estimators based on (4.8). 
This estimator was studied in more detail in Matthys and Beirlant (2003). Among 
others, it was proven that the asymptotic distribution of y^ L and q^ p are quite sim¬ 
ilar. Indeed, compared to (4.18), the asymptotic variance now becomes y 2 (号 

instead of y 2 in (4.18). Note that equation (4.20) can also be used to estimate small 
exceedance probabilities. Indeed, fixing q^ p at a high level, (4.20) can be solved 
numerically for p. The resulting estimator for p will be denoted by 

The bias-correcting effect obtained from using j/jJ L and the factor 
exp — ((k + 1)/(” + l)p)~^]/ is illustrated in Figure 4.6 where we show 

the medians computed over 100 samples of size n = 1000 from the Burr(l,0.5,2) 
distribution and p = 0.0002. Next to p and q^ p , we also show the estimator 


Xn—k,n 


' 众+ 1 、 
、（《+ 


which is in fact q^ p with Hk, n replaced by 


The probability view 

Following the approach outlined in section 4.5.2 of fitting a perturbed Pareto distri¬ 
bution to the relative excesses Yj, j = l,..., k, above a threshold X n —k ， n ，results 




TAIL ESTIMATION UNDER PARETO-TYPE MODELS 


123 


in the following tail estimator 

pf! x = ^^G PPD (x/X„— k ， n _ ， y^ PD ,c+ PD , r+ PD ) (4.21) 

where Gppo denotes the survival function of the perturbed Pareto distribution 
(PPD) (4.14) introduced in section 4.5.2. Fixing p^ x at a small value, (4.21) can 
be solved numerically for x, yielding an extreme quantile estimator. This estimator 
will be denoted by q^ p . 


An example: the SOA Group Medical Insurance data. 


We illustrate the use of the above-introduced estimators for the extreme value 
index and extreme quantiles with the SOA Group Medical Insurance data. In 
Figure 4.7(a), we plot j/jJ L (solid line), Hk, n (broken line), y^ k (broken-dotted 
line) and Yp PD (dotted line) for the 1991 claim data against k. This plot indicates a 
y estimate around 0.35. Insurance companies typically are interested in an estimate 
of the claim amount that will be exceeded (on average) only once in, say, 100,000 
cases. We illustrate the estimation of extreme quantiles in Figure 4.7(b). In this 
figure, we plot q^ p (solid line), p (broken line) and q^ p (broken-dotted line) 
for U (100,000) as a function of k. 


4.7 Adaptive Selection of the Tail Sample Fraction 

We now turn to the estimation of the optimal sample fraction needed to apply a 
tail index estimator like the Hill estimator. It should be intuitively clear that the 
estimates of b n ,k, the parameter that dominates the bias of the Hill estimator as 
discussed in section 4.4, should be helpful to locate the values of k for which the 
bias of the Hill estimator is too large, or for which the mean squared error of 
the estimator is minimal. Several methods have been proposed recently, which we 
review briefly. See also Hall and Welsh (1985) and Beirlant et al. (1996b). 

(i) Guillou and Hall (2001) propose to choose n where k is the smallest value 
of k for which 


[Y 也 (-” 

V 12 H k ， n 


> C cr it ， 


where c cr it is a critical value such as 1.25 or 1.5. 

To understand this standardization, first remark that on the basis of Theorem 
4.2, one can show that if \fkb n ^ —>• c G M, then 
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Figure 4.7 SOA Group Medical Insurance data: (a) y^ L (solid line), n (broken 
line), y^ k (broken-dotted line) and Yp PD (dotted line) as a function of k and (b) q^ p 

(solid line), q^ p (broken line) and q^ p (broken-dotted line) for [7(100,000) as a 
function of k. 

So, after appropriate standardization of the procedure given in 

Guillou and Hall (2001) can be considered as an asymptotic test for zero 
(asymptotic) expectation of b^ s (—l): the bias in the Hill estimator is con¬ 
sidered to be too large, and hence the hypothesis of zero bias is rejected, 
when the asymptotic mean in the limit result appears significantly different 
from zero. 
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(ii) An important alternative, popular among statisticians, is to minimize the 
mean squared error. Then we try to minimize the asymptotic mean squared 
error of Hk , n ，that is, 

2 / i \ 2 

AMSE{H Kn ) = AVar(H kjl ) + ABias 2 (H k , n ) = ^ + ( 7 ^) ( 4 . 22 ) 

as derived before. So it appears natural to use the maximum likelihood 

estimators discussed above and search for the value of k, which minimizes 

_ 

this estimated mean squared error plot {(k, AMSE (/4，„)); k = l,... ,n — 1}. 
This simple method can of course also be applied to, for instance, the AMSE 
of Weissman quantile estimators based on the expressions given in (4.16) 
and (4.17). When estimating U (100,000) in case of the SOA Group Medical 
Insurance data, we arrive in this way at the value k = 486 that is to be 
considered in Figure 4.7(b). 

(iii) Let us restrict again to the Hall-class of distributions where the unknown 
distribution satisfies 


U(x) = Cx y (1 + Dx~^(l + (x oo). 


for some constants C > 0, D g M. Observe that in this case, b(x) = —fiDx 谷 
(1 + o(l)) as x — oo. Then the asymptotic mean squared error of the Hill 
estimator is minimal for 


k n 雜〜 ( 沪 ㈨ )-" (i+2 奶 d 价 


(n oo). 


Here, because of the particular form of b, we obtain 


kn. 


opt 


b 1 、 




-l/d+2 灼 


, 2 ^/( 1 + 2 ^) 


> 2 d+^) 

、 ~ W 


2\ 1/(1+2妁 


(4.23) 


for any secondary value ko g {l,... ,n] with ko = o(n). We plug in consis¬ 
tent estimators of b n ^ 0 , P and y in this expression as discussed above, all 
based on the upper ko extremes. In this way, we obtain for each value of ko 
an estimator of k n opt . 

Then as ko, n oo, ko/n ^ 0 and oo, we have that 


k n . 


ko 






opt 


Of course, a drawback of this approach is that in practice one needs to iden¬ 
tify the 々 o-region for which ^/kob n ^ 0 - > oo in order to obtain a consistent 
method. However, graphs of logl„ ，灸 0 as a function of ko are quite stable, 
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Figure 4.8 SOA Group Medical Insurance data: plot of log k n ^ 0 versus ko, ko = 
3, ... ， 10, 000. 

except for the A ： o-regions corresponding to \fk^b n ^ 0 0. This is illustrated 
in Figure 4.8 for the SOA Group Medical Insurance data set. The plot of 
\og k n ^ Q is stable from ko = 3000 up to ko = 7000, indicating a log A: value 
around 5.3. This value corresponds to the endpoint of a stable horizontal area 
in the Hill plot given in Figure 4.7(b) with height at approximately 0.36. 

In order to set up an automatic method, from a practical point of view one 
can use the median of the first \n /2J^-values as an overall estimate for k n , opt : 

^ 卜 n l 

kn.med = median 卜，知 : A: 0 = 3,..., | . (4.24) 

(iv) In Hall (1990), a novel resampling technique to estimate the mean squared 
error of the Hill estimator is proposed. For this purpose, the usual bootstrap 
does not work properly, especially because it seriously underestimates bias. 
This problem can be circumvented by taking resamples of smaller size than 
the original one and linking the bootstrap estimates for the optimal subsample 
fraction to k n , opt for the full sample. However, in order to establish this link, 
HalFs method requires that p = l, which puts a serious restriction on the 
tail behaviour of the data. Moreover, an initial estimate is needed to estimate 
the bias. As pointed out by Gomes and Oliveira (2001), the entire procedure 
is highly sensitive to the choice of this initial value. 

The idea of subsample bootstrapping is taken up in a broader method by 
Danielsson et al. (1997). Instead of bootstrapping the mean squared error 
of the Hill estimator itself, they use an auxiliary statistic, the mean squared 




TAIL ESTIMATION UNDER PARETO-TYPE MODELS 


127 


Unfortunately, the usual bootstrap estimate for k n , opt does not converge in 
probability to the true value; it merely converges in distribution to a random 
sequence owing to the characteristic balance between variance and squared 
bias at the optimal threshold. A subsample bootstrap remedies this problem. 
Taking subsamples of size n\ — 0{n x ~ e ) for some 0 < e < 1 provides a 
consistent bootstrap estimate h ni ， opt for k ni , opt . 

Further, the ratio of optimal sample and subsample fractions for Ak, n is of 
the order 

2 i8 

2^+T 


For n\ = 0(n x ~ e ), 0 < £ < 0.5, this ratio can be estimated through a second 
subsample bootstrap, now with subsamples of size = n\/n, such that 

kn,opt ^n\,opt 
^n\,opt k n 2,opt 

Combining these results gives 


^n. 


opt 


kn\. 


opt 


n 


error of which converges at the same rate and which has a known asymptotic 
mean, independent of the parameters y and p. Such a statistic is 

M,n = — 2H^ n 

with 

I k 

= - y^(log X n ^j + \ n — log X n ^k yn ) 2 . 

K 7=1 

Since both H^ n /(2Hk, n ) and Hk, n are consistent estimators for y, Ak^ n 
will converge to 0 for intermediate sequences of 众 -values as n ^ oo. Thus 
AMSE(Ak, n )= 五 oo(A^„)，and no initial parameter estimate is needed to 
calculate the bootstrap counterpart. 

Moreover, the 众 -value that minimizes AMSE(Ak^ n ), denoted by k n ^ opt , is of 
the same order in n as k n , opt : 
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for k n ， opt , where 
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^ •— log k nu opt 

2 log (石 

is a consistent estimator for p. 

Under condition (C—p(b)) for \ogiu, it can be shown that the resulting Hill 
estimator Hj^ 寧 n has the same asymptotic efficiency as Hk nopt , n . 

The algorithm for this bootstrap procedure is summarized as follows 


(a) Draw B bootstrap subsamples of size n\ g (^/n, n) from the original 
sample and determine the value k ni opt that minimizes the bootstrap 
mean squared error of Ak,m- 

(b) Repeat this for B bootstrap subsamples of size n 2 = n\/n and deter¬ 
mine k n2 ^pt where the bootstrap mean squared error of A^, W2 is 
minimal. 

(c) Calculate k n ,opt from (4.25) and estimate y by Hj^ • n . 

This procedure considerably extends and improves Hall’s original bootstrap 
method, especially because no preliminary parameter estimate is needed. 
Only the sub sample size n \ and the number of bootstrap resamples B have to 
be chosen. In fact, the latter is determined mainly by the available computing 
time. In simulation studies reported in the literature, the number of resamples 
ranges from 250 to 5000. As for the subsample size, Danielsson and de 
Vries (1997) suggest varying n\ over a grid of values and using a bootstrap 
diagnostic to select its optimal value adaptively. Gomes and Oliveira (2001), 
however, found that the method is very robust with respect to the choice 
of n\. We also refer to Gomes and Oliveira (2001) for more variations 
and simulation results on the above bootstrap to choose the optimal sample 
fraction and for a refined version of Hall’s method. 

(v) Drees and Kaufmann (1998) present a sequential procedure to select the 
optimal sample fraction k n , opt . From a law of the iterated logarithm, they 
construct ‘stopping times’ for the sequence n of Hill estimators that are 
asymptotically equivalent to a deterministic sequence. An ingenious combi¬ 
nation of two such stopping times then attains the same rate of convergence 
as k n , opt . However, the conversion factor to pass from this combination of 
stopping times to k n ^ opt involves the unknown parameters y (which requires 
an initial estimate yo) and 

We refer to the original paper by Drees and Kaufmann (1998) for the theoret¬ 
ical principles behind this procedure and immediately describe the algorithm 
with the choices of nuisance parameters as proposed by these authors. 
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(a) Obtain an initial estimate yo := // 2v /«,n f° r Y • 

(b) For r n = 2.5yon 0 - 25 compute the 'stopping time’ 

K(r n ) ：= min ke 1 }| max - H k ^ n ) > r n 

\<i<k 

(c) Similarly, compute k n (r^) for s = 0.7. 

(d) With a consistent estimator for p, calculate 



and estimate / by „ 


In simulations, it was found that the method mostly performs better if a fixed 
value po is used for p in (4.26)，in particular, for # 三 = 1. 


In Matthys and Beirlant (2000)，Beirlant et al. (2002c) and Gomes and Oliveira 
(2001), these adaptive procedures have been compared on the basis of extensive 
small sample simulations. While both the bootstrap method and the plug-in method 
tend to give rather variable values for the optimal sample fraction, the results for 
all four adaptive Hill estimators are well in line. The sequential procedure and 
the method based on k n ， me d give the best results even when setting = \. The 
influence of a wrong specification of the parameter ^ in these methods does not 
seem to be a major problem. In comparison with other procedures, method (iii) 
performs best in case of small values of p and even for distributions outside the 
range of distributions considered by Hall and Welsh (1984), such as the log-gamma 
distribution. The methods based on the regression models above such as (ii) and 
(iii) ask most computing effort. The sequential method appears to be the fastest 
overall. 
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In Chapter 2, we derived the general conditions (C y ) and (C*) for a non-degenerate 
limit distribution of the normalized maximum of a sample of independent and 
identically distributed random variables to exist: 


U (xu) — U (x) 
a(x) 


u y - 1 


y 


for any m > 0 as x —> oo, 


(Cy) 


for some regularly varying function a with index y, where U (x) = Q (l — 士)， 
respectively 

- = - ^ (1 + yy) 1 y for any ;y > 0 as f 个 x+, (C*) 

F(t) 

for some auxiliary function b. 


In the preceding chapter, we outlined the extreme value approach for tail esti¬ 
mation in case y > 0, that is, when F is of Pareto-type. Now, in the present 
chapter, we discuss statistical tail estimation methods that can serve in all cases, 
whether the extreme value index (EVI) is positive, negative, or zero. The available 
methods can be grouped in three sets: 

• the method of block maxima, inspired by the limit behaviour of the normalized 
maximum of a random sample, 

• the quantile view with methods based on (versions of) (C y ), continuing the 
line of approach started with Hill’s estimator, 
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• the probability view, or the peaks over threshold approach (POT) with meth¬ 
ods based on (C*). Here, one considers the conditional distribution of the 

excesses over relatively high thresholds t, interpreting 即其二 ⑴) as 

p (wf > > ()■ 

Next to these approaches, we also briefly mention the possibilities of exponen¬ 
tial regression models, generalizing the exponential representations of spacings as 
considered in section 4.4. 


5.1 The Method of Block Maxima 


5.1.1 The basic model 


In Chapter 2, it was proven that the extreme value distributions are the only possible 
limiting forms for a normalized maximum of a random sample, at least when a 
non-degenerate limit exists. On the basis of this result, the EVI can be estimated 
by fitting the generalized extreme value distribution (GEV) 


G(x; cr, y, /x)= 


exp(-(l + K ^)-^), 1 + y^ >0 ,y^0, 

exp (- exp (-〒)）， x eR,y =0, 


with cr > 0 and /x g M to maxima of subsamples (Gumbel (1958)). This approach 
is popular in the environmental sciences where the GEV is fitted to, for example, 
yearly maximal temperatures or yearly maximal river discharges. 

5.1.2 Parameter estimation 

For notational convenience, we denote the maximum of a sample X\,..., X n by 
Y. Then when a sample Y\,... ,Y m of independent sample maxima is available, 
the parameters cr, y and /x can be estimated in a variety of ways. In Chapter 2, we 
already discussed the data-analytic method of selecting the y value that maximizes 
the correlation coefficient on the GEV quantile plot followed by a least-squares fit 
to obtain estimates for /x and a. In this section, we will focus on the maximum 
likelihood (ML) method and the method of (probability weighted) moments. 


The ML method 


In case y ^ 0, the log-likelihood function for a sample Y\,... ,Y m of i.i.d. GEV 
random variables is given by 


\ogL(cr, y, = -m\oga -( 丄 + 1) 二 log (1 + 


Yi — P、 


y- 




Yi-t^yy 


(5.2) 
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provided 1 + > 0, / = 1， .. • ， m. When y = 0, the log-likelihood function 

reduces to 

m / V — \ m V-— 

log L(a, 0, /x) = —m log a — Eexp —.(5.3) 

i=l \ ° ’ i=\ ° 

The ML estimator (a, y, /t) for (cr, y, /x) is obtained by maximizing (5.2)-(5.3). For 
computational details, we refer to Prescott and Walden (1980), Prescott and Walden 
(1983), Hosking (1985) and Macleod (1989). Since the support of G depends 
on the unknown parameter values, the usual regularity conditions underlying the 
asymptotic properties of maximum likelihood estimators are not satisfied. This 
problem is studied in depth in Smith (1985). In case y > —0.5, the usual proper¬ 
ties of consistency, asymptotic efficiency and asymptotic normality hold. In fact, 
for m ^ oo 

Vm ((a, y, /i) - (a, y, /x)) ^ 卵， R) y > -0.5 

where V\ is the inverse of the Fisher information matrix. For more details about 
the Fisher information matrix, we refer to the Appendix at the end of this chapter. 
This limit result in principle is valid under the assumption that Y is distributed as 
a GEV. Remark, however, that the results of Chapter 2 only guarantee that Y is 
approximately GEV. 

The method of probability-weighted moments 

In general, the probability-weighted moments (PWM) of a random variable Y with 
distribution function F, introduced by Greenwood et al. (1979), are the quantities 

M p , r ,s = E{Y p [F(Y)] r [\ - F(y)rj (5.4) 

for real p, r and The specific case of PWM parameter estimation for the GEV 
is studied extensively in Hosking et al. (1985). In case y # 0, setting p = 1, r = 
0, 1 ， 2,... and 5 = 0 yields for the GEV 

Mi, r ,o = M ^ [1 - (r + D^rci - y)] I / < 1. (5.5) 

Assume a sample Y\,... ,Y m of i.i.d. GEV random variables is available. The 
PWM estimator (a, y, jl) for (cr, y, /x) is the solution to the following system of 
equations, obtained from (5.5) with r = 0, 1,2 ， 


^1,0,0 = 

/x - (1 — r(i — y)) 

(5.6) 

y 


2Mi ， i，o — Mi ， o,o = 

-r(l- y)(2^ - 1 ) 
y 

(5.7) 

3Mi ， 2,0 — ^1,0,0 _ 

3 y - 1 

(5.8) 

2 的 丄 o — Mi >0 ,o 

iy - l 
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after replacing Mi，，o by the unbiased estimator (see Landwehr et al. (1979)) 


Mi. 


7 = 1 \£=1 / 

or by the asymptotically equivalent consistent estimator 


似 l ， r，0 


E| 


、 m+ I〆 


F)，m. 


Note that to obtain y, (5.8) has to be solved numerically. Next, (5.7) can be solved 
for cr, yielding 

^ _ y( 2 Mi ? i 5 o — ^i,o,o) 

r(l-y)(2/-l)' 

Finally, given y and a, jl can be obtained from (5.6): 

A = ^i,o,o + ^ (l — T(1 — y)). 

To derive the limiting distribution of (cr, y, jl), we need the limiting behaviour 
0 ( (Mi 5 o,o, 岛士 0 , 岛， 2 ,o). Define M = (Mi ， o,o, Mi ， i ， o, ^i, 2 ,o) / and M = 
^ 1 , 1 , ^\, 2 ,oY- Provided y < 0.5, it can be shown that for m ^ oo 

Vm(M - M) Af(0, V) 


where the elements of V are given by 


V r ,r = 


> +1)y 


(r(i — 2y)/s ： ("(r+i))-r 2 (i -}/))， 


v r ,r+i = 1 - ( 5 ) + 2) 2 >T(1 - 2y)K(r/(r + 2)) 

+ (r + l) y [(r + 1)^ - 2(r + 2)^] r 2 (l -]/)}, 

v r ,r+s = \(^) { (… + l) 2y r(l — 2 Y )K(r/(r + s + l)) 

-ir + 4>T(1 - 2y)K((r + l)/(r + s)) 

+2(r + l) y [(r + # - (r + s + 1))'] r 2 (l - y)} 


S >2, 


and K(x) = 2 ^ 1 (—y, —2y; 1 — y; —x), with 2^1 denoting the hypergeometric 
function. 
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Table 5.1 ML and PWM estimates for 
the Meuse data. 


Method 

0 

y 

/X 

ML 

466.468 

-0.092 

1266.896 

PWM 

468.358 

-0.099 

1267.688 


Now define 0 = (a, y, /x )’， 0 = (a, y, /t) / and write the solution to (5.6), (5.7) 
and (5.8) as the vector equation 0 = f(M). Further, let G denote the 3x3 matrix 
with generic elements gij = dfi/dM\,j,o, i, j = 1, 2, 3. Application of the delta 
method yields the limiting distribution of 0: 

^/m(e - 6 >) 3 N(0, V 2 ) 

where V 2 = GVG f , as m —> 00 provided y < 0.5. 

Example 5.1 In Table 5.1, we show the ML and PWM estimates for the parame¬ 
ters (cr, y, /x) obtained from fitting the GEV to the annual maximum discharges of 
the Meuse river. Note that the estimates obtained under the two estimation methods 
agree quite good. The fit of the GEV to these data can be visually assessed by 
inspecting the GEV quantile plot, introduced in Chapter 2. Figure 5.1 shows the 
GEV quantile plot obtained with ⑻ ML and (b) PWM. 

We still refer to some other estimation methods for the GEV that have been 
discussed in literature: best linear unbiased estimation (Balakrishnan and Chan 
(1992)), Bayes estimation (Lye et al. (1993)), method of moments (Christopeit 
(1994)), and minimum distance estimation (Dietrich and Hlisler (1996)). In Coles 
and Dixon (1999), it is shown that maximum penalized likelihood estimation 
improves the small sample properties of a likelihood-based analysis. 


5.1.3 Estimation of extreme quantiles 

Estimates of extreme quantiles of the GEV can be obtained by inverting the GEV 
distribution function given by (5.1), yielding 


qY，p 


^ \og(l - p))-y -H y^O, 


At - o- log (- log(l — p)), 


r = 0 , 


(5.9) 


and replacing (cr, y, /x) by either the ML or probability-weighted moments esti¬ 
mates. In case y < 0, the right endpoint of the GEV is finite and given by 




The ML estimate of qy, p can also be obtained directly by a reparametrization 
such that qy, p is one of the model parameters, for instance, substituting qy, p — 
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n + ^[(-\og{\- P r)-y -n k / o , 
/x — a log (- log(l - p) n ), y = 0, 


8 - 
LO 

500 1000 1500 2000 2500 3000 3500 

GEV quantiles 
(b) 

Figure 5.1 GEV 2 2-plot of the annual maximum discharges of the Meuse river 
using ⑻ ML and (b) PWM estimates. 

-[(— log(l — p))~ Y — 1] for /z. Note that, in case the GEV is used as an approx¬ 
imation to the distribution of the largest observation in a sample, (5.9) yields the 
quantiles of the maximum distribution. Since Fx nn = F n ^ H, one easily obtains 
the quantiles of the original X data as 


500 1000 1500 2000 2500 3000 3500 

GEV quantiles 

⑻ 



ooln s 80 CO 89 CV1 8038SI. 00CH 
扣 ixELU IEnuu< 



where n is the block length. 
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0 20 40 60 80 100 

-1/log(1 -p) 

Figure 5.2 GEV-based quantile estimates for the annual maximum discharges of 
the Meuse river; solid line: based on ML, broken line: based on PWM. 

Example 5.1 (continued) In Figure 5.2, we illustrate the estimation of quantiles 
of the annual maximum discharges of the Meuse river. The solid line (broken 
line) represents quantile estimates based on the ML (PWM) estimates of the GEV 
parameters. 


5.1.4 Inference: confidence intervals 

Confidence intervals and other forms of inference concerning the GEV parameters 
(a, y, fi) follow immediately from the approximate normality of the ML and PWM 
estimators. For instance, a 100(1 — a)% confidence interval for the tail index y is 
given by 

y ± $ _1 (1 - 

where y is the ML or PWM estimate of y and 1 ) 2,2 denotes the second diago¬ 
nal element of V\, or V 2 , after replacing the unknown parameters by their esti¬ 
mates. Similarly, inference concerning the GEV quantiles can be based on the 
normal limiting behaviour. Straightforward application of the delta method 
yields 




oooco 


OOSCNJ 


OOOCVJ 


00S 


dlb 


Vm(q Y , p - qy, P ) ^ A^(0, k'Vk) as m —> 00 
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where qy, p denotes the estimator for qy^ p obtained by plugging the ML or PWM 
estimators into (5.9), and where V is V\, or V 2 , and 

叫 y,p 叫 y,p d ^r,p 

dcr dy d/x 

= - - p)yy - ii 

IY 

-^2 [(- log(l - p))~ Y - 11 - -( - log(l - p))~ Y log (— log(l - p)), 1 

y 2 y 

Inference based on these normal limit results may be misleading as the normal 
approximation to the true sampling distribution of the respective estimator may 
be rather poor. In general, better approximations can be obtained by the profile 
likelihood function. The profile likelihood function (Barndorff-Nielsen and Cox 
(1994)) of y is given by 

L p (y) = max L(a, y, /x). 

Therefore, the profile likelihood ratio statistic 

. L p (y 0 ) 

A = - — 

L P {y) 

equals the classical likelihood ratio statistic for testing the hypothesis Ho : y = yo 
versus H\ : y and hence, under Hq, for m —> 00 , 

-2 log A ^ xl- 

The special case of testing Ho ： y = 0 (the so-called Gumbel hypothesis) is described 
in Hosking (1984). Since Hq will be rejected at significance level a if —2 log A > 
— a), the profile likelihood-based 100(1 — a)% confidence interval for y is 
given by 

Cly^ly ： -2 log I -^p- < Xl 2 (l-«)1 

Lp(y) I 

or equivalently 

a X?(l — a)] 

C/ y = jy : logLp(y) > logL p (y) --- \ . 

Profile likelihood-based confidence intervals for the other GEV parameters can be 
constructed in a similar way. 

Example 5.1 (continued) The profile likelihood-based 95% confidence intervals 
for the EVI and the 0.99 quantile of the annual maximum discharges of the Meuse 
river are given in Figure 5.3(a) and (b) respectively. Note that the 95% confidence 
interval for y contains the value 0, so at a significance level of 5%, the hypothesis 
Ho : y = 0 cannot be rejected. Hence, for practical purposes, the annual maximum 
discharges can be adequately modelled by the Gumbel distribution. 
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Figure 5.3 Profile log-likelihood function and profile likelihood-based 95% con¬ 
fidence intervals for (a) y and (b) qo. 99 . 
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A major weakness with the GEV distribution is that it utilizes only the max¬ 
imum and thus many data are wasted. Another problem is the determination of 
an appropriate block size n, especially in case of time series data where the time 
dependence is to be thinned out by using appropriate independent blocks from 
which one extracts one maximum; this will be a topic of interest in Chapter 10 on 
extreme value methods in time series analysis. To lift up the first problem threshold 
methods and methods based on the k, largest order statistics have been developed. 
Those are reviewed now. 


5.2 Quantile View — Methods Based on (C y ) 

Several estimators based on extreme order statistics are available in order to 
estimate a real-valued EVI, and correspondingly large quantiles and small tail prob¬ 
abilities. These methods rely mainly on the conditions (C y ) and (C y ). We discuss 
here three methods: the estimator proposed by Pickands (1975) and its generaliza¬ 
tions, the moment estimator from Dekkers et al. (1989) and the estimators based 
on the generalized quantile plot proposed in Beirlant et al. (1996c) and Beirlant 
et al. (2002b). 


5.2.1 Pickands estimator 

From (C y ), we obtain 


1 lQ J U(4y)-U(2y) } 

° § 1 U(2y) - U(y) J 


1 1q „ U(^y) - U(2y) a(2y) a(y) 

l^g2 ° § 1 ^a(y) U(2y) - U(y) 


— log{V ⑵ ^ oo. 


Treating the limit as an approximate equality for large y = (n + \) / k and replacing 
U(x) by its empirical version U n (x) = leads to the Pickands (1975) 

estimator for the EVI 


yp,i 


log 2 


log 


'X«-「"41 + l ，《 — X „-「"21 + l ， n 、 
、义 n— 「灸 /2"| + 1 ，《 _ ^n-k-\-l,n / 


for k = l,... ,n. Pickands original definition uses 4k rather k. 

The great simplicity of the Pickands estimator yp^ is quite appealing but 
unfortunately offset by its rather large asymptotic variance, equal to y 2 (2 2y+1 + 
l){(2 y — l)log(2)}— 2 (Dekkers and de Haan 1989)，and its large volatility as a 
function of k. This motivated the quest for more efficient variants. 
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A recent proposal by Segers (2004) is the estimator 

Yki^i = f log(X n _ ,n — Xn— |7 灸 "| ， n)d 入 (/") 


E 


\og(X n —^ c jj^ n — 


,).(5.10) 


Here, 0 < c < 1 while 入 is a right-continuous function on [0, 1] such that 入 (0)= 
X(l) = 0 and /q 1 X(t)t~ l dt = 1. The simplest example is 

0 if0<t<v, 

入 AO = l/log(l/i;) if i; < r < 1, 

0 


for some 0 < i; < 1. The estimator yk(c, 入 u) is in fact the one proposed by Yun 
(2002), including as special cases the ones by Pickands (1975) [c = v = 1/2], 
Pereira (1994) and Fraga Alves (1995) [c = v], and Yun (2000b) [1/4 < c < 1 
and v = (4c) -1 ]. A more general example is 

= {fi(t/v) - /x(f)}/log(l/i;), 0<t<\, 

where again 0 < u < 1 and where \x is the distribution function of a probability 
measure concentrated on (0, 1]. The estimator 入 can be regarded as a 
mixture of the estimator yk(c, X v ) over different values of k, encompassing thereby 
the estimators of Drees (1995) and Falk (1994). 

Segers (2004) establishes asymptotic normality of yk(c, X) under the conditions 
of Theorem 3.1. For fixed 0 < c < 1 and y ^ —1/2, the limiting asymptotic vari¬ 
ance, cr 2 (y, c, X) is minimal for 入 equal to 入以 ， where 5 = |y + 1/2| — 1/2 and, 
for t G [c y , c J_1 ) (positive integer j). 


^-5,c(0 = 


I _ c 8j 

(l-c 1+s )- ~-f, if 8^0, 
l — c d 

(1 — c 1+s )jt, if 5 = 0. 


In this case, 


^ (^5 K) = ^ K? ^*c,y) = 


y 2 (l-c 1+ y) 2 
C (1 — c ^) 2 
(1 - c) 2 
c(log c) 2 ’ 


for y > —1/2 and y / 0, 

for y = 0, 
for y < —1/2. 


The optimal choice of c is c —> 1， in which case 


^ (y) 


limcr 2 (y) 
c 个 1 



for y > —1/2, 
for y < —1/2. 
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Clearly, choosing c = 1 in (5.10) does not lead to an admissible estimator. This 
poses no problem in practice, however, as the choice c = 0.75 already leads 
to a relative efficiency of 96%. Observe that for y > —1/2, the limiting vari¬ 
ance cr 2 (y) = (1 + y ) 2 is that of the ML estimator for y in the GP model 
(Smith 1987). 

The optimal choice for X depends on y, which is unknown. The solution is to 
define y = yk(c, c ), where 5 = \y + 1/2| — 1/2 and y is an arbitrary consistent 
estimator of y based on the X n —k+i,i = 1for instance, y = yk(c, 入 o,c). The 
asymptotic variance of this two-stage procedure is the same as when we would 
use y rather than y (Segers 2004). The estimator is illustrated for the SOA data 
in Figure 5.5. 


5.2.2 The moment estimator 

The moment estimator has been introduced by Dekkers et al. (1989) as a direct 
generalization of the Hill estimator: 

= Hkn +1 - 去 (i - ， 


where 


1 k 

< ) = ^E( lo s^ -;+l,n - log X n -k, n ) 


To understand this estimator, we can proceed as follows: for any j g {1,. 
we have that 


• •, A:}, 


log X n —y + l,n — log ^n—k,n = U n 


、+ 1 、 


log 么 


'n+ V 


、众 + 1 7 ’ 

and hence logX n _ 7 - + i 5 „ — log X n -k, n can be seen as an estimate of 


log" 


^ + r 


log U I 


log U 


'n-\-V 


、灸 + 1 J 
n + 1 \ / k 1 

V - r. 


log U I 


4 + r 
+1 / 


Now, choosing x = and u = in (C K ), then for any j g {1, 
n/k —> oo 


log 气 


log X n —j-\-\,n _ log X n —k,n 


\ a (错) 

a (irr) (m) 


k+\ 


，㈣ 


if y > o, 
if / < 0. 
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For k —> oo, we have the following limiting results 


, k 

^E lo s 


k+ 1 


log udu — 1, 


k 


E( log ^ r " 


—> / (\ogu) z du = 2, 


k 


E 

j= 

k 

E 


(u—y — \)du = —^ — (y < 0), 


i ^ / / • \ 


y 


k^\\k+i 


1 —I (u~ y - l) 2 du 


2y 2 


(l-y)(l-2y) 


(/ < 0 ). 


We see therefore that as A:, w > oo and k/n —> 0, 

H ln p\ ^ > 0 , 

< \ 綠， if ， <0 . 

The consistency of the moment estimator now follows since 



V ， if K > o, 

0, if y < 0, 


since in the non Pareto-type case where y < 0, the slope of the Pareto quantile 
plot will tend to zero near the higher observations. 


5.2.3 Estimators based on the generalized quantile plot 


Following (2.15), the function U (x)^i og x(log U (x)) is regularly varying with index 
y since indeed also a is a regularly varying function. Therefore, 

U(x)H(x) := U(x)e\ ogX (logU(x)) = x y l UH (x), 

for some slowly varying function Ijjh. Hence, as in case of the Pareto quantile 
plot, when x —> oo 

log P (x)eiogA-(logf/ (x))) 

- 1 - > h 

logx 

this is, when plotting log (U (x)^i og x(log U (x))) versus logx, we obtain an ulti¬ 
mately linear graph with slope y. In practice, we replace x by 异 | and we estimate 
x(log (x)) with the Hill estimator Hj, n . We obtain that the plot 

log(^|) ， log(X „— 灿 ％'„)) ， _/= 1 ， … .，n - 1 ， （ 5.11) 

will be ultimately linear with slope y. 
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Example 5.2 In Figure 5.4, this is illustrated for the wind-speed data from three 
cities in the United States, introduced in Chapter 1. These data are the daily fastest- 
mile speeds measured by anemometers 10 m above the ground. The line structures 
in the generalized quantile plots are the result of an inherent grouping of the data 
due to loss of accuracy during the data-collecting process. For the Des Moines daily 
wind-speed maxima (n = 5478), the generalized quantile plot (5.11) clearly shows 
an increasing behaviour, which reflects a heavy tail for the underlying distribution. 
The flattening trend in the Grand Rapids dataset (n = 5478) suggests a weaker tail 
with y = 0, while for Albuquerque (n = 6939) even a negative y-value, resulting 
in a distribution with a finite right endpoint, can be expected. 

As in the previous chapter, one can now establish an estimation procedure 
analogous to that induced by the Hill estimator. The slope in the generalized 
quantile plot is then estimated by 

k 

Vk,n = l0g UH J n ~ l0g UH M,n, 

；=1 

where UHj n \= X n —j n Hj n . 


Next to the above-discussed approach based on Hill-type operations on the UH 
statistics, the slope of the ultimate linear part of the generalized quantile plot can 
also be estimated by an unconstrained least-squares fit to the k ‘last’ points on the 
generalized quantile plot, as proposed by Beirlant et al. (2002b). Minimizing 

log UHj, n - .5 - y log 

with respect to 8 and y results in the so-called Zipf estimator: 

' z — m=i ( lo g 7TT - F T!i=\ lo S 7+r) lo § UHj'n 
I T, k j=i lo § 2 ttt - (i E>=i lo s ttt) 

An interesting property of this estimator is the smoothness of the realizations as a 
function of k, which alleviates the problem of choosing k to some extent. 



Example 5.3 We illustrate the above-introduced quantile-based estimators on the 
SOA Group Medical Insurance claim data. In Figure 5.5, we plot yp^ (solid line), 
Mk, n (broken line), y^ n (broken-dotted line) and y^ n (dotted line) as a function of 
k. The moment, generalized Hill and Zipf estimator are quite stable when plotted 
as a function of k and indicate a y value of around 0.35, a result that is consistent 
with the estimates obtained in Chapter 4. Also the Pickands estimator indicates a 
y estimate of around 0.3 to 0.4 but, compared to the other estimators, shows a 
much larger variability. 
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5.3 Tail Probability View_Peaks-Over-Threshold 
Method 


5.3.1 The basic model 


The left-hand side in the (C*) condition can be interpreted as the conditional 
survival function of the exceedances (or peaks, or excesses) Y = X — t over a 
threshold t, taken at yb{t) > 0: 

- (X-t \ F(t-\- yb(t)) 

F t (yb(t)) :=P(Y> yb(t)\Y >0) = P \^ 1 —>y\X >tj = - 心 


Hence, from (C*), it appears a natural statistical procedure to approximate the 
distribution F t by the distribution given by the right-hand side in (C*): 


尹 t(y ) 〜 



(5.12) 


Interpreting b(t) in this last expression as a scale parameter cr, we are lead to fit 
the GP distribution, H ， specified by 

'1 _ (1 + ； ye(0；oo) if y > 0, 

1 -exp(-^), j g (0, oo) if y = 0, (5.13) 

!-(! +- (0,-^) if y < 0, 

to the exceedances over a sufficiently high threshold. 

The use of the GP distribution as approximate model for exceedances over high 
thresholds can also be motivated on the basis of a point process characterization of 
high-level exceedances. For more details about point processes we refer the reader 
to section 5.9.2. Let X\,..., X n be independent random variables with common 
distribution function F where F satisfies (C y ) and consider the two-dimensional 
point process 


P n = 


i Xi — \ 

n + 1 ’ a n ) 



where a n and b n normalize X n , n appropriately, as discussed in Chapter 2. It can 
be shown that on sets that exclude the lower boundary, P n converges weakly to a 
two-dimensional Poisson process. The intensity measure A of the limiting Poisson 
process can be immediately derived from the Poisson property. Indeed, since 

lim P(no points in (0, 1) x (x, oo)) = lim P ( — — < x 

oo n->oo \ a n 


= exp(-(l + yx)~y), 
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Figure 5.6 Illustration of the point process characterization of high-level 
exceedances. 

we have that for sets A = {t \, t 2 ) x (x, oo), t\ < t 2 , 

A(A) = (t 2 - h) (1 + yx)~y . (5.14) 

In Figure 5.6, a graphical illustration is provided for this point process 
interpretation. 

Now, for a sufficiently large u 

Xi - b n Xi -b n \ A((0, 1) x (u-\-x, oo)) 

- > u x - > u \ ^ - 

a n ci n ) A((0, 1) x (w, oo)) 

yx 

1 + yu 

which is the GP survival function with scale a(u) = 1 + yu. For practical pur¬ 
poses, the unknown normalizing constants a n and b n can be absorbed in the 
GEV distribution. So above high thresholds, P n can be approximated by a two- 
dimensional Poisson process with intensity measure 

A(A) = (?2 — t\) (1 + }/ 






Similarly, 


F u (x) ^ 


yx 

cr + y(u - /x) 


y 
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For a detailed mathematical derivation of these point process results, we refer the 
interested reader to Leadbetter et al. (1983), Falk et al. (1994) and Embrechts et al. 
(1997). 


5.3.2 Parameter estimation 


Given a value of the threshold t and the number of data N t from the original 
sample X\,, X n exceeding t, the estimation of the parameters y and o can 
be performed in a variety of ways. We mention the ML method, the method of 
(probability-weighted) moments and the elemental percentile method (EPM). We 
denote the absolute exceedances by Yj = Xj — t, provided Xi > t, j = l,..., N t , 
where i is the index of the y-th exceedance in the original sample. Often, the 
threshold is taken at one of the sample points, that is, t = X n -k^ n . In this case, the 
ordered exceedances are given by Yj,k = X n —k+j, n — X n —k， n ，j = l,..., k. 


The ML method 

The log-likelihood function for a sample Y\,, Y^ t of i.i.d. GP random variables 
is given by 

广 1 \ Nt 

log L(cr, y) = -N t logcr - ( - + 1 j J^log 

W ' i=l 

provided 1 + 孕 > 0, l = 1,If y = 0, the exponential distribution-based 
log-likelihood function given by 

1 从 

log L(a, 0) = -N t log a —— V] ^ 

i=l 

has to be used. The maximization of log L(cr, y) can be best performed using a 
reparametrization 

y 

(cr, y) (r, y) with r =- 

G 

yielding 



log L(r, y) = -N t log y + ^ log r - 



N t 

^log(l + r7/). 


The ML estimators r^ n and y^ n then follow from 
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where 


N t 


yZn = + 


The method of probability-weighted moments 

The method of moments (MOM) and the method of probability-weighted moments 
(PWM) estimators for the GP distribution were introduced by Hosking and Wallis 
(1987). Both methods share the basic idea that estimators for unknown parameters 
can be derived from the expressions for the population moments. The r-th moment 
of the GP distribution exists if / < 1/r. Provided that they exist, the mean and 
the variance of the GP distribution are given by respectively 


E(Y) = 

—1 _ 〆 

(5.15) 

var(F)= 

a 2 

(5.16) 

- (1- k) 2(1_2 / )- 


Assume a sample Y\,, Y^ t of i.i.d. GP random variables is available. The order 
statistics associated with Y\,, Y^ t are denoted by Y\^ t < • • • < YN t ,N t - Replac- 
ing E(Y) by f = E,=i Yi/N, and var(r) by Sj = - Y) 2 /(N t - 1) and 

solving (5.15)-(5.16) for y and a yields the MOM estimators: 


Xmom : 


^mom = — ( 1 + 


；)■ 


We now turn to PWM estimation of the GP parameters o and y. In case 
of the GP distribution, it is convenient to consider (5.4) with p = l, r = 0 and 
5 = 0, 1, 2,... leading to 


^1,0,s 


(s+l)(s+l-y) 


r < i- 


(5.17) 


Replacing M\^^ s by its empirical counterpart (as in case of fitting a GEV 
distribution) 



or 




N t 


N t 


^ + 1 


Y j,N t ， 
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and solving (5.17) for 5 = 0 and ^ = 1 with respect to y and a yields the PWM 
estimators 


Kpwm = 2 — 


^1,0,0 

岛， 0,0 — 2Mi 5 o,i 


^PWM = ~ ~ . 

^i,o,o - 2 Mi ， o，i 

Note that the PWM estimator for y can be written as a ratio of weighted sums of 
ordered exceedances. In case M\,o,s is used as an estimator for this then 

yields 


Kpwm = 


忐 ( 4 ^+t - 3 ) 


) Y hN, 


忐 H N jU ( 2 %+t - 1 



Application of the MOM and PWM estimators is not without problems. First, 
in case y > l, the MOM and PWM estimators do not exist. Second, the obtained 
estimates may be inconsistent with the observed data in the sense that in case 
y < 0, some of the observations may fall above the estimate of the right endpoint. 


The elemental percentile method 

The elemental percentile method (EPM) introduced by Castillo and Hadi (1997) 
overcomes some of the difficulties associated with the ML method and the method 
of (probability-weighted) moments. In fact, for this method, there are no restrictions 
on the value of y. Here, we will concentrate on the estimation of y ^ 0. In case 
y = 0, the parameter a can be estimated efficiently with the ML method. Assume a 
sample Y\,, Y^ t of i.i.d. GP random variables is available. Consider two distinct 
order statistics Yi,N t and Yj,N r Equating the GP cumulative distribution function 
evaluated at these order statistics to the corresponding percentile values gives a 
system of two equations in two unknowns: 


1 — (1 + ii ， jYi ， N t ) 

Yl,j = Pi ， n ， 

(5.18) 

1 - (1 + ii ， jYj， Nt ) 

Yl ’ j = Pj ， n ， 

(5.19) 


where, as before x = y/a and p“ n = Elimination of ytj yields 


Cj log(l + TijY itNl ) = Ci log(l + rijYj, Nl ) 


where C, = — log(l — pi， n ), which can be solved numerically for 
iij into (5.18) (or (5.19)) and solving for yij, we obtain 


i. Plugging 


Yi,j 


log(l + TijY itNt ) 


Ci 
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and then 

ru 

A * 

In order to use all available information, ytj and &ij are computed for all 
distinct pairs of order statistics Yi,N t < ^j,N t leading to the final EPM estimators 


Yepm = median{ i < j], 


^epm = median{a,, ; -; i < j}. 

In case i = and j = it is easy to show that the system of equations (5.18) 
and (5.19) has a closed-form solution given by 



1 , Y l^,N, - Y \^.n, 

(5.20) 


= - log --- " -, 

log 2 y^ Nt 



(5.21) 

f 「誓 i, 「孕 i = 

Y 2 

\^.N, 


In fact, (5.20) is the Pickands (1975) estimator for y as discussed above. 


In the above discussion, we always assumed that a sample ，…， Y^ t of i.i.d. 
GP random variables is available. If the data are not exact GP distributed, one 
can rely on relation (5.12) and use the GP distribution as an approximation to 
the conditional distribution of the exceedances. In this case, the GP distribution is 
fitted to the excesses Yj = Xi — t, in case X/ > t, j = \ ，…， N t , using one of the 
above described methods. Note that in the latter case, N t is random. 

Example 5.3 (continued) Applying the POT approach to the SO A Group Medical 
Insurance data introduced in section 1.3.3 with a threshold of 400,000 USD, we fit 
the GP distribution to the excesses yj = x, — 400,000. The ML procedure leads to 
y = 0.3823 when t = 400,000. The quality of this GP fit to the empirical distribu¬ 
tion function of the data Yi is depicted in Figure 5.7(a). Figure 5.7(b) contains the 
W-p\ot of the GP fit to the exceedances over t = 400,000. In Table 5.2, we show 
the ML, MOM, PWM and EPM estimates for the parameters o and y obtained 
from fitting the GP distribution to the excesses over t = 400,000. 


The choice of the threshold t is very much an open matter and resembles 
the choice of the value of k in the previous chapter. As in the case of the Hill 
estimator, a compromise has to be found between high values of t, where the bias 
of the estimator will be smallest, and low values of t, where the variance will be 
smallest. In the literature on the POT method, not much attention has been given to 
this aspect. Davison and Smith (1990) propose to use the mean excess plot. Indeed, 
the mean excess function of the GP distribution is given by the linear expression 
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Figure 5.8 

estimate (broken line), probability-weighted moments estimate (broken-dotted line) 
and elemental percentile estimate (dotted line) as a function of k. 


Hence, one is lead to the graphical approach choosing t = X n _k n as the point 
to the right of which a linear pattern appears in the plot {(X n -k^ n , Ek, n )\ k = 
1, ..., n — 1}. 

The POT method, however, will often lead to stable plots of the estimates 
as a function of k, less volatile than for the case of the Hill plots. An illustration 
is found in Figure 5.8 concerning the SOA Group Medical Insurance data. 

We need to emphasize that, whereas the POT method yields more stable plots 
for the estimates as a function of the threshold t, the bias can still be quite sub¬ 
stantial. 


Table 5.2 ML, MOM, PWM and 
EPM estimates for the SOA Group 
Medical Insurance data with t = 
400,000. 


Method 

a 

y 

ML 

142,489 

0.3823 

MOM 

156,841 

0.3095 

PWM 

142,933 

0.3707 

EPM 

139,838 

0.4112 


0 100 200 300 400 500 

k 

SOA Group Medical Insurance data: ML estimate (solid line), MOM 
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5.4 Estimators Based on an Exponential Regression 
Model 


In this section, we will discuss the approximate representation for log-ratios of 
spacings of successive order statistics as derived by Matthys and Beirlant (2003). 
This representation extends the exponential regression models derived in Chapter 4 
in a natural way to the general case where y g M. 

Let Uj， n , j = 1 ， …， n denote the order statistics of a random sample of size 
n from the U (0, 1) distribution. Then, for k = l,..., n — l, (Vj^ := Uj, n /Uk 代 n\ 
j = 1, ..., k) are jointly distributed as the order statistics of a random sample 
of size k from the U (0, 1) distribution. As before, Ej, j = l,... ,n denote stan¬ 
dard exponential random variables and E j， n , j = l,... ,n, are the corresponding 
ascending order statistics. The inverse probability integral transform together with 
(C y ) imply that, for j = l, ... 

Xn-j+i.,, - X n _ k ' n 兰 U(UJ})- U(U^ l n ) 

= U(V7lu^ hn )-U(U^ ln ) 

, v~l - 1 

〜啊 - 

provided k/n —> 0. For a log-ratio of spacings of order statistics, we then obtain 


log 


^n-j-]-l,n _ X n —k,n 
Xn—j,n 一 ^n—k,n 


log 


yj 


v i + \, k 



Application of the mean value theorem with E* k e. = 

exp (— and the Renyi representation we have that 


log 




v 


V 


-y 


log (exp(y E k _ j+hk ) - l) - log (s\p(y E k _ jtk ) - l) 


j+hk 


(Ek-j+i,k - Ek—j'k) 


yexp(yE* k ) 
exp(yE* k ) - 1 


V 


1 - 


(u y 


We now replace Vj k by j/(k 1) to obtain the following approximate represen¬ 
tation for log-ratios of spacings 


7 log 


j+l,n _ ^n—k,n ^ 
Xn—j，n _ X n —k，n 




(5.22) 



^n—k,n 




灸 +1 、 

、（"+ 


p 

d 

0 1000 2000 3000 4000 5000 

k 

Figure 5.9 SO A Group Medical Insurance data: (solid line) and 

(broken line) as a function of k. 

from which y can be estimated with the ML method. By construction, the resulting 
ML estimator, denoted is invariant with respect to a shift and a rescaling 

of the data. Later on in this chapter, we will refine this estimator by imposing a 
second-order tail condition on U. 

Example 5.3 (continued) In Figure 5.9, we illustrate the use of y^ A (solid line) 
on the SOA Group Medical Insurance claim data set. The exponential regression 
model approach also indicates a y estimate of around 0.35, a result that is consistent 
with the earlier analysis. 

5.5 Extreme Tail Probability, Large Quantile and 
Endpoint Estimation Using Threshold Methods 

5.5.1 The quantile view 

On the basis of (C y ), we take xu = ^ and x = so that U(xu) = Q(l — p) and 
U (x) = X n -k n lead to the following general form of extreme quantile estimator: 




90 寸 ds 

CCLULUCOO) 
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where a ( 鞋 f) and y denote estimators of a (jtj) and y respectively. Any one of 
the estimators of y discussed above can be used. Concerning a, it appears natural 
from (2.15) to consider 


4+1、 
、 A: + 1, 


(1 - y~)UH Kn = (1 - i>—)X n 一 k ， n H k , n . 


Alternatively, following Matthys and Beirlant (2003), can also be esti¬ 

mated on the basis of an approximate exponential regression model for spacings 
of successive order statistics. On the basis of (C y ), for k/n ^ 0, 




n-y+l,n - 入 n-j，n 〜 --- ， 


1 ， • . •，众 _ 


Application of the mean value theorem, with the same notation for E*. k and V* k 
as in section 5.4 and using the Renyi representation results in 

v il - V j+I,k v exp(yE k ^j +ltk ) - ex^(Y E k-j,k) 


=(Ek—j+i'k — E k - jtk )exp(yE* jk ) 






Hence, after replacing Vj k by the following approximate exponential regres¬ 
sion model for spacings of successive order statistics is obtained 


v 


j i^n—j-\-l,n — ^n—j,n) ^ ^n,k+\ 


、& + 1, 


Ej, j = h...,k, (5.24) 


with a n ^+i = a ( 註 I). Using straightforward derivations, the log-likelihood func¬ 
tion of model (5.24) is maximal at 




' 〉: j (X n —j+i， n - X n 


+ 1 / 


(5.25) 


Extreme quantiles can now be estimated using 


㉗ 


•MA 




(Jc±]_Y^ A 

0 

十 ^n,k+\ 


)RMA 

k,n 


where a n ^-\-\ is as in (5.25) but with y replaced by y^ A - 

Concerning the estimation of extreme tail probabilities P(X > x) condition 
(C*) similarly leads to setting U (x) + va(x) =: y. 


P { y)= k -±\(l^^- Xn ^ 


-i/i> 


n \ 




(5.26) 
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Finally, when y < 0, an estimation of is obtained by letting — > 0 in (5.23): 


又 + = ^n—k,n 


9 U+i, 


Xfi—k，n 


y 


^n—k,n tik.n • 


(5.27) 


5.5.2 The probability view 

Extreme quantiles of the GP distribution can be estimated by inverting the GP 
distribution function given by (5.13), yielding 




a ^ v - i) r^o, 


-CT log p 


y = 0 , 


(5.28) 


and replacing the unknown parameters by one of the above described estimates. In 
case y < 0, the right endpoint of the GP distribution is finite and can be obtained 
by letting p —> 0 in (5.28): 


a 




If the data are not exact GP distributed, relation (5.12) implies that 


F t (y) 


F(t) 


1 + 


yy\~^/y 


so that with x = t y 


F(x) 〜 F(t) 1 + 


y(x - ty 


-i/y 


Estimating F(t) by N t /n and replacing y and a by their respective ML, MOM, 
PWM or EPM estimates, we obtain that 


h X ) = ^U + 9 -^ 

n \ g 


-l/K 


(5.29) 


The POT estimator for large quantiles Q(l — p) can now be obtained from invert¬ 
ing the right-hand side in (5.29): 


香 4 


n. 


(5.30) 


Note that in case y < 0, an estimator for the right endpoint of the support of the 
distribution = 2(1) is obtained by letting p ^ 0 in (5.30): 


X-\. = t 




(5.31) 



TAIL ESTIMATION FOR ALL DOMAINS OF ATTRACTION 


159 


Figure 5.10 SOA Group Medical Insurance data: U (100,000) as a function of k. 

Note further that in case the parameters of the GP distribution are estimated with the 
ML method, extreme quantile estimates can be obtained directly by reparametrizing 
the log-likelihood function in terms of U(+), for example, setting 

YiU(\)-t) 

^ ⑼ ' 1 . 

Example 5.3 (continued) In Figure 5.10, we illustrate the estimation of extreme 
quantiles using the POT approach. Figure 5.10 shows the estimates for U (100,000) 
as a function of k. Here, U was obtained by plugging the ML estimates for y and a 
in (5.30). A stable region appears for k from 200 up to 500, leading to an estimate 
of 4 million. 

5.5.3 Inference: confidence intervals 

Approximate 100(1 — a)% confidence intervals for the parameters y and a of the 
GP distribution can be constructed on the basis of the asymptotic normality of the 
ML, MOM and PWM estimators. For instance, a 100(1 — a)% confidence interval 
for y is given by 

y ± $ _1 (1 -a/ 2 ) 

where y is either the ML, MOM or PWM estimate for y and Ci,i is the first 
diagonal element of respectively V\, V 2 or V 3 (for more information on these 




00000 On 
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covariance matrices, we refer to section 5.6) with the unknown parameters replaced 
by their estimates. Inference about return levels U ( 士 ) can be drawn in a similar 
way. Straightforward application of the delta method gives 

Vn~,(u ( 去)一 u ( 去 ))S 

where V is either V\, V 2 or and 

du(j) dU(j)~ 

dy ， da 

=~^2 (P~ V ~ ~ -P~ V iogp, - (p~ r - 1), 

i y 2 r r 」 

so a 100(1 — a)% confidence interval for U (-^) is given by 

U (I) 士 中 ― 1 (1 — a/2) 

Often, better confidence intervals can be constructed on the basis of the profile 
likelihood ratio test statistic. The profile likelihood function for y is given by 

L p (y) = maxL(cr, y). 

Using similar arguments as in case of the GEV, the 100(1 — a)% profile likelihood 
confidence interval for the parameter y can be obtained as 

a x?(l — a )} 

CIy = \y \ logLp(y) > logL p (y) --- | . 

The special case of testing // 0 : y = 0 is described in Marohn (1999). 

Example 5.3 (continued) Figure 5.11 illustrates the profile likelihood function- 
based confidence intervals using the SOA Group Medical Insurance data. In 
Figure 5.11(a) and (b), we show the profile log-likelihood function of y and 
U (100,000) respectively at k = 200, together with the 95% confidence interval. 

5.6 Asymptotic Results Under (C y )-(C*) 

In order to be able to construct asymptotic confidence intervals or tests concerning 
y, we now discuss the most relevant asymptotic results and present some asymp¬ 
totic comparisons between some of the estimators. 

In case of the ML and the probability-weighted moment approach for peaks 
over thresholds, one can develop asymptotic results under the assumption that the 






Figure 5.11 SO A Group Medical Insurance data: profile log-likelihood function 
and profile likelihood-based 95% confidence intervals dX k = 200 for (a) y and 
(b) t/(100,000). 
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excesses exactly follow a GP distribution ; rather than assuming (C y ) or, equiva¬ 
lently, (C*) (see also section 5.1 concerning the method of block maxima). This is 
a restrictive approach that should be validated of course through some goodness- 
of-fit methods that were developed above. Under this parametric approach, the 
following asymptotic results can be stated concerning the POT methods: 


(i) The ML estimators (y^ L n , are asymptotically normal: provided y > 

— 1/2, for N t ^ oo 

拉 ((戍 L „， 

where 


Vi = (l + y) 


1 + y ~cr 
—o 2cr 2 


while the ML estimator is superefficient when —1 > y > —1 /2; that is, the 
ML estimator converges with rate of consistency N^ y . 

(ii) The MOM estimators (^mom, ^mom) satisfy for N t ^ oo 


y/^t ((Kmom^ ^mom) - (7, °0) $ ^ (0, V 2 ), 

where 

v ^ r \ (1- 2y) 2 (l -r + 6y 2 ) -a(l - 2y)(l - 4y + 12y 2 ) 

2 _ -a(l-2y)(l-4y + 12y 2 ) 2a 2 (l - 6y + I2y 2 ) 


(1 - Y ) 2 

(1-2y)(l-3/)(1-4y) 


provided y < 1/4. 


(iii) The probability-weighted moment estimators (^pwm ， ^pwm) satisfy, provided 
y < 1/2, as > 00 

\/^((Ppwm ， o'pwm) - (y，cr)) 4 #(0, V3) 

where 

v =r r (1 一 y)(2 - r) 2 (l - 3 / + 2y 2 ) -a (2 - y)(2 -6y + 7y 2 -2y 3 )~ 

3 — [-a(2 - y)(2 - 6y + 7y 2 - 2y 3 ) a 2 (7 - 18y + lly 2 - 2y 3 ) 」’ 


C = 


1 

(l-2y)(3-2y) 
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(iv) Likewise, the asymptotic properties of the initial estimators yij and &ij 
in the EPM can be derived. First, consider the limiting behaviour of order 
statistics Yj,N t )- Let i = and j = LMg」，0 < /? < g < 1， and 

let Q denote the GP quantile function 

Q(P) = -((l - P)~ y - l). 

y 

It can be shown that for N t oo 

Yj， Nt ) - (Q(p), Q(q))) ^ N(0, W) 


where 


「 a 2 p(l- p)- 2 y~ l a 2 p(\- P ry-\\-q)-y' 

w -[cr 2 p(l- pr y ~ l (l-qr r <r 2 q(l - qr 2 ^ / 

Remark that these limit results for the marginal distributions were derived 
in section 3.2，case 2 (iv). Straightforward application of the delta method 
then yields the limiting behaviour of the initial estimators: 


拉 ^ N(0, CWC') 


with 

1 

C =- 


1 - (1 - p)~ y - 1 

- [Q(q)-\-cr log(l - 《)（1 一 q)~ Y ] Q{p) + a log(l - p){\ - p)~ y 


and 


n = log(l - p)Q(q){l - p)~ Y - log(l - q)Q(p)(l - q)~ v . 

Another more general point of view recognizes that, in general, the excesses 
are not exactly GP distributed, but that the POT distribution approaches a GP 
distribution for high-enough thresholds as assumed under (C K )-(C*). For this, one 
typically adds an assumption concerning the rate of convergence of the excess dis¬ 
tribution to the GP family. This can be found in the second-order theory discussed 
in Chapter 3. The semi-parametric point of view then results in the appearance of 
an asymptotic bias in the results. 


Following the theory of generalized regular variation of second order outlined 
in de Haan and Stadtmuller (1996)，we assume the existence of a positive function 
a and a second ultimately positive auxiliary function with a2(x) —>• 0 when 
x oo, such that 

=c^ t y ~ l h p (t)dt + Ah y+P (u). (5.32) 


lim - 

X—OO fl 2 (X) 


U (ux) — U (x) 
a(x) 


h y (u) 
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In the sequel, we denote the class of generalized second order regularly varying 
functions U satisfying (5.32) with GRV 2 (y, p ； a(x), a 2 (x); c, A). 

We restrict the discussion here to the case p < 0, in which case a clever choice 
of the auxiliary function a 2 results in a simplification of the limit function in (5.32) 
with c = 0. 

In Appendix 5.9.3, we give an overview of possible kinds of GRV 2 functions 
and the corresponding representations for U and log U as given in Vanroelen 
(2003). From this list, it follows that the second-order rate in (5.32) is worse for 
log U compared to U when p < y < 0 and in some cases when 0 < y < —p. In 
these cases, this will entail asymptotic relative efficiency 0 for estimators based 
on log-transformed data compared to shift invariant estimators such as the ML 
estimator or the Pickands type estimators, this is if all these estimators are based 
on the pertaining optimal number of order statistics. 

When 0 < y < —p, this rate problem for log U arises with the appearance of 
the constant D in the characterization of U in that case, namely, U(x) = i^x y {^ + 
Dx~ Y + ^^a 2 (x)(l + 0(1))}. When Z) = 0, the original 句 -rate is kept for log U, 
while it is not when D ^ 0, in which case ai is replaced by a regularly function with 
index —y. Within the Hall class of Pareto-type distributions (see section 3.3.2), 
the case D ^ 0 occurs when = y. This is the case, for instance, for the Fisher 
F and the GEV distributions. Also remark the special representation in case y + 
p = 0, where a slowly varying function L 2 appears, discussed in Appendix 5.9.3. 
The representations for U, respectively log U, as given in Appendix 5.9.3 can 
be used to derive the asymptotic mean squared errors (AMSEs) of some well- 
known estimators, see Appendix 5.9.4. In this, we assume that the slowly varying 
parts of b and are asymptotically equivalent to a constant. The optimal values 
of k, which minimize the different expressions of the AMSEs together with the 
corresponding minimal AMSE values, are found in Appendix 5.9.5. Matthys and 
Beirlant (2003) and Segers (2004) contain similar asymptotic results for yj^ A , 
y^ B (see section 5.7.1), respectively yk(c, 入） • 

We end this section by specifying the asymptotic distribution of U (1/p) as 
defined in (5.23) with the moment estimator M、 n substituted for y. Such asymp¬ 
totic results were first proven in de Haan and Rootzen (1993) and were further 
explored in Ferreira et al. (2003). Matthys and Beirlant (2003) provide analogous 
results for p). 

Considering the conditions of Proposition 3.2, Ferreira et al. (2003) defined the 
function 


ai{x), 

x* — a(x)/U(x), 


a2(x)= 


购 ⑴ 
K+P 9 


if y < p， 
if p < y < 0 

or 0 < y < —p with D ★ 0 
or y = -p, 
ify> —p 

or 0 < y < —p with D = 0, 
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where D is as in the representations for U given in Vanroelen (2003), see also 
Appendix 5.9.3. 

Let a n = k/(np n ). Then when U (oo) > 0, and k = k n ^ oo such that n/k n 
oo, np n c > 0 (finite), Vka2(n/k) 0 and (log a n )/Vk 0 as n —> oo, we 

have that as y # 0 and y ^ p 


• in case y > 0 


(ir f 1 \ 

a (f)«« \oga n V \Pn) 

• while in case y < 0 



~u^))^N(0Ai + yf), 


N ( Q (l-y) 2 (l-3y+4y 2 ) \ 

V ' y 4 (l-2/)(1-3y)(l-4y )； 


In case \fka 2 {n/k) —> A. g M an asymptotic bias appears, see, for instance, Ferreira 
et al. (2003). 


5.7 Reducing the Bias 

In this section, we show how some of the estimators based on the first-order 
condition (C y ) can be refined by taking into account the second-order tail behaviour. 
This then is analogous to section 4.4. We confine ourselves here to the estimator 
based on the exponential regression model introduced in section 5.4. 


5.7.1 The quantile view 


From the discussion in Chapter 2, we have that F e V(G y ) implies the existence 
of a slowly varying function i and a function d with 士 d e IZq and d(x) ^ y as 
x -> oo such that 


U (ux) — U (x) 


,i{ux) 


- 1 ). 


(5.33) 


a(x) d(x) \ i{x) 

Matthys and Beirlant (2003) refined the exponential regression model (5.22) by 
imposing the second-order condition (3.14) on the function i. 

Using the inverse probability integral transform (5.33) and (3.14), one easily 
obtains for j = 1, ... ， A:, 

v 


- X n - k ,n = U(U^ l n V7l) - U(U^ hn ) 

" n '" +1 V lk ~ 




'^n,k-\-l ^j ^ exp Cln,k-\-\ 
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as kin 0, where c n ^\ := a(U^： l n )/d(Uj^： l n ) anda w ^ + i := ca 2 (U^： hn ). Tak¬ 
ing a log-ratio of spacings results in 


log 


^n— 7+1,« _ ^n—k,n 

Xfi—j,n — X n —k，n 


V J ex p(««.^+i^7r- 

士 g— - 7 - 

Op 卜 a +1 ， 


j ~ 1 ，...，々 _ 1. 


We now apply the mean value theorem (with the same notation for E* k and VJ k as 
in section 5.4) and the Renyi representation of standard exponential order statistics 
to the right-hand side of the above equation: 


log 


VjJ eX P ( a n,k-\-\ 


p 


^j+l,k eX P ( a n,M ]+l p 


V 


log- 


exp (n) +u + fl ,a +1 exp( 吨 - 广 w ) 

exp (yl M + a> a +1 e _ £ y)— 


= (Ek-j+i,k — Ek-j,k)- 


y + a n ^ k+x cxp(pE* j k ) 




1 -< 


v Ej y+a n ,k+i (v* k ) 

'1 - ㈣ 、彳 ㈣ 辑 


and hence, after replacing V U b y we obtain the following approximate rep¬ 
resentation for log-ratios of spacings 


yl0g Xn -J+ 1 ^ ~ Xn - k ，" Z . 
Xn—j,n 一 X n —k，n 


y + a 'a+i (i+t) 


㈤ 


y expL, +1 Mll 


-Ej, 


P 


7 = 1,...,^-!. (5.34) 


Note that if a n ^\ = 0 model (5.34) reduces to (5.22). The parameters y, a n ^\ 
and p of model (5.34) can be jointly estimated with the ML method. We denote 
these ML estimators by i^^ B , and Since (5.34) is invariant with 

respect to shifts and rescalings of the data, all estimators based on (5.34) share the 
same invariance property. 
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5.7.2 Extreme quantiles and small exceedance probabilities 

We continue the discussion of the exponential regression model approach of 
Matthys and Beirlant (2003). Using (5.33) together with condition (3.14) on 
one easily obtains the following asymptotic representation 

1 

U (—) — X n —k， n 〜 C n ^k-\-l 
P 

where, as before, c n ^\ ：= ^k-h, n )/d(U^： l n ) and := ca 2 (U^： l n ). 

Estimating y, a n ^\ and p by respectively 2^^ and pf^ B , c n ^+\ can be 

estimated as 





1 j (X n -j+]_， n - X n —j， n ) (^T) 

k . / _-RMB\ / 

户 + jexp U^ + B r 


^rY' 


Finally, replacing y,p, a n ^+\ and c n ^ + i by their respective estimators in (5.35) 
leads to a bias corrected estimator As with the refined ML estimator 

k,n v p 7 

y^ B for y, usually succeeds well in reducing the bias of 

On the other hand, it has a higher variance and hence is often less optimal in 
mean squared error (MSE) sense. Note that for a fixed high value of the 

above equation can be solved numerically for p, yielding an exceedance probability 
estimate. 


5.8 Adaptive Selection of the Tail Sample Fraction 

As we know from the previous chapter, successful practical application of EVI 
estimators crucially depends on the selection of a good or possibly optimal k- 
value. We continue the discussion along the lines of section 5.6. In that section, 
we provided the theoretical optimal A>values for some of the better-known EVI 
estimators. The optimal values clearly depend on the EVI y and the parameters 
b(n) and p describing the second-order tail behaviour. Replacing these unknown 
parameters by their respective estimates then yields an estimate for k opt . To this 
aim, we take a closer look at the regression through the k ultimate points on the 
generalized quantile plot. 

On the basis of condition (C y ), F g V(G y ) implies UH e 1Z Y and hence the 
generalized quantile plot 
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will be ultimately linear. Further, the slope of this ultimate linear part approximates 
y. Under the Hall (1982) model, 

{ Cxy (1 + Dx^(l + 0(1))) (x^oo) if K > 0, 

X — x + — Cx Y (1 + Dx p (l + o(l))) (x oo) if y < 0, 

with C > 0 and D g M, Beirlant et al. (2002b) derived the following approximate 
model: 


Zy := 0'+ l)log 


UHj, n 

UH j+hn 


=y + b(n/k) 



^j ■> = 1， ... ，众， 


(5.36) 

where £j are considered as zero-centred error terms. Ignoring the second term in 
the right-hand side of (5.36) results in the reduced model 


0*+Dlog 


UHj, n 

UH j+hn 


~ y + £_/， j = 1 ， ... ，众， 


for which y^ n is the least-squares estimator. The full model (5.36) can be exploited 
directly to propose an estimator for y using a least-squares method, thereby replac¬ 
ing p by an estimator p. Beirlant et al. (2002c) propose 


Pk,X,n = 


1 , ^ix 2 k],n ~ 爹 LU 」，《 

^ os 紀 


入 e (0,1)， 


as an appropriate choice, which is a consistent estimator when k is chosen such that 
\/kb{n/k) oo. For practical diagnostic purposes it can be sufficient to replace 
p by a canonical choice such as -1. For a given p-value, y and b(n/k) can be 
estimated using least squares, resulting in 


VLS,k(p) = ~ ^ ls , a ：( p )/(1 


八 

bhS,k(p) 




k 


-2 

P 


P), 

k 

E 






These least-squares estimators can now be used to estimate the k opt values as given 
in section 5.6 in an adaptive way. For brevity, we consider the estimator y^ n in 
case y > 0. The procedure can, however, be applied without any problem to the 
other estimators considered in that section. Because of the fact that ai G 1Z P , the 
AMSE of the simple estimator y^ n is minimal for 


kopt 


-2/(l-2p) 


^0 


2 y o/(l-2^)/(l + X 2 )(l-p) 2 


l/(l-2p) 


~2p 


(5.37) 


for any secondary value ko g {l,... ,n — l}. Plugging consistent estimators for 
y, b(n/ko) and p, for instance, the least-squares estimators, into (5.37) yields an 
estimator for k opt . In this way, for each value of ko, an estimator of k opt is obtained. 
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5.9 Appendices 

5.9.1 Information matrix for the GEV 


Consider a GEV-distributed random variable X. Let 0 r = (cr, /, /x) and denote by 
g the GEV density function: 

1 / 1 ( ( X-ix\~y\ 

g(x; <T, y, At) = - I 1 + y ^ I exp I - I 1 + / ^ I I. 

Then the information matrix 


1(0) = —E 


d 2 \ogg(x；ey 


dedo , 


has as generic elements 




h,2(0) — 


cry z 


1 - y* _ g + 


r(2 + y ), 〆 


/u(e) __ ( ,_r(2 + y) ), 


/ 2 ’ 2 , p 


f + (w* + i)J + 4 

6 V y) y y z 


h, 3 ( 0 ) = - ( q - 

\ 

o L 

where y* = 0.5772157 is Euler’s constant, 


P = (1 + y) r(l + 2y), 

q = r(2+ y) (V(l + y) + 


i + y 、 


with \jr{x) = d\ogT{x)/dx. 


5.9.2 Point processes 

The peaks-over-threshold (POT) method in section 5.3.1 relies on a parametric 
model for a certain point process. Moreover, point process techniques are useful 
in inference on multivariate extremes or extremes of time series. For the reader’s 
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convenience, we include here a very brief and informal introduction on point 
processes. For a proper study of the subject, the reader should consult books such 
as Resnick (1987) or Snyder and Miller (1991), among others. 

Let {Xi : i g X} represent the locations of points, indexed by a set X, occurring 
randomly in a state space S. A point process N counts the number of points in 
regions of S: 


N(A) = 攸 “)， A 以 . 

ieX 

The expected number of points in a set A is given by the intensity measure A (A)= 
^[^(A)]. If the state space S is Euclidean space or a subset thereof and if the 
intensity measure A has a density function A. : 5 —> [0, oo), that is, if A (A)= 
f A 入 then X is called the intensity function of the process. 

The most common type of point processes are Poisson processes. A point 
process N with intensity measure A is said to be a Poisson process if the following 
two conditions are fulfilled: (i) for each set A such that A(A) < oo is N(A) a 
Poisson random variable with mean A(A); (ii) for all positive integer k and all 
disjoint sets Ai, ..., are the random variables A^(Ai), ..., N(Ajc) independent. 
A Poisson process on a (subset of) Euclidean space is called homogenous if its 
intensity function 入 is constant, 入 (x) 三入， and inhomogenous otherwise. 

More generally, a marked point process counts for each point Xi a quantity Yi 
and has representation 


N(A) = 

ieX 

the marks {Yi}i e j being identically distributed. Observe that a marked point process 
with all marks equal to unity is simply a point process. 

A compound Poisson process is a marked point process for which the points 
Xi occur according to a Poisson process independently of the marks Yi, which are 
themselves independent and identically distributed. We shall denote by CP ( 入， n) 
a compound Poisson process with intensity function 入 and mark distribution n. 

A sequence of (marked) point processes N n on a state space S is said to 
converge in distribution to a (marked) point process N, notation N n —> N, if for 
each positive integer k and all sets Ai,..., the vector (N n (Ai))^ =l converges 
in distribution to the vector A typical way to establish convergence 

of point processes is via convergence of Laplace functionals. 

If the intensity function, A, of a Poisson process N depends on an unknown 
parameter vector, 0, then we can estimate 0 by ML. In order to construct the 
likelihood, we first have to choose a region A in the sample space such that 
A(A; 汐 ） < oo for all 0. A crucial property of a Poisson process is now that the 
points in a region A conditionally on their number N(A) are independent and 
identically distribution with common density f(x; 0) = X(x; 0)/A(A; 0). More¬ 
over, N(A) has a Poisson distribution with mean A(A; 0). Therefore, if the points 
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falling in A of a realization of N can be enumerated as x\,..., x m , then the 
likelihood is 

A(A;6>) m A Hxi ； 0) 

L(0) = exp{-A(A; 0)}— —— — Ff 

F m! 1 1 A(A;0) 


oc exp{-A(A; 0). 


5.9.3 GRV 2 functions with p < 0 

We restrict to cases where a2(x) is regularly varying with index p < 0 and with a 
slowly varying part being asymptotically equivalent to a constant. Then, without 
loss of generality, this constant can be set equal to 1. 

From Vanroelen (2003), we obtain the following representations of U. 

• 0 < —p < y: for U e GRV2(y, p\ i^x y , a2(x); 0, A): 

U(x) = i + x Y [ — + — a 2 (x)(l + o(l))l, 

[ y y + p J 

• / = —p: for U G GRV 2 iy, —y ； £+x y , x~ y £ 2 M', 0, A) with some slowly 
varying function: 

U(x) = +x _k L 2 (x)J 

with L 2 (x) = B + / 1 "(A + o(l))^-dt + o{l 2 (t)) for some constant B, 

• 0 < y < —p: for U e GRV2(y, p\ a 2 (x); 0, A): 

U(x) = £ + x y \— + Dx~ Y + —-— a2(x){\ + o(l))l, 

y v v p J 


• y = 0: for t/ G GRV2(0, p; € + , a 2 (x); 0, A): 


U (x) = £ + logx + D + — 叱 0)(1 + 0(1 ))， 
P 


• y < 0: for U e GRViiy, p\ l-\~x y , a 2 (x); 0, A): 

U(x) = U(oo) - i + x y \ -- —a 2 (x)(l + 0(1)) j, 

y -y y + p J 

where -£ + > 0, A / 0, D g R. 

Concerning log U, the following results are available under these re¬ 
presentations: 

• If 0 < -p < y then log U e GRV 2 (0, p; y, a 2 (x); 0, ^); 
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• If y = -p then log U G GRV 2 (0, -y; y, x~ y L 2 (x); 0, -y); 

• If 0 < y < —p then log U g G/?V 2 ( 0 , —y; y, x~ y ; 0, —yD) if D 7 ^ 0, and 
logf/ G GRV 2 (0, p; y, a 2 (x); 0, if Z) = 0; 

• If y = 0 then logf/e GRV 2 (0, 0; 截， ^ ； -l,0); 

• If y < p then log U e GRV 2 (y, p; [U(oo)]~ 1 l + x y ,a 2 (x); 0, A); 

• If p < y < 0 then \ogU e GRV 2 (y, y; [U(oo)]~H^x y , l^x y ; 0, - # 1 ㈣ ); 

• If y = p then \ogU g GRV 2 (y, y; [U(oo)]~H + x y , a 2 (x); 0, A - K J+ o) ). 


5.9.4 Asymptotic mean squared errors 

In the statement of our results, we will use the following notations: 


Ap[p-\-y{\-p)] 

(y+/))(l-p) 


a2(x) 


b(x)= 


(i+y): 


(l+K)^ 


- y L 2 (x) 


log Z ( 乂） 

Ap(l-K) / x 
(l-]/-p) a2 ( X ) 

_ y _ 


\-2y 


-2y f/(oo 广 

A( 1 -k) 


U(oo) 


if0<—p<yorif0<y< —p 
with D = 0, 

a y = -p, 

if 0 < y < —p with D ^ 0, 
if / = 0, 
if y < p ， 
if p < y < 0, 


if y = p, 


and 


P = 


—y if 0 < y < —p with D ^ 0, 
p if 0 < —p < y or if 0 < y < —p with D = 0, 

0 if y = 0, 

P if y < p, 

Y if p <y <0. 


Below, we derive the AMSEs of the different estimators. 


• For the estimator yj ^ n ： 


AMSE(y k H n )= 


芊 + Gv ( t )) 2 ， 

(1 - T ^ 2y2) + (^( I )) 2 


if K > 0, 
if y < 0. 
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• For the estimator y ^ n ： 


AMSE(y k z n )= 


2(l+y+y 2 ) 

k 


( 1 -/ 0 ) 



2(l-y)(l+2y+y 2 -2y 3 ) , ( 1 , /n\\ 2 

(l-2y)(l-y)k r 


if y > o, 
if y < 0. 


• The AMSE of the moment estimator „ can be found from Dekkers et al. 
(1989): 


AMSE(M Kn )= 


¥ + ( t % M !)) 2 
i + Mt)- 


(l-y) 2 (l-2y)(6y 2 -y+l) 

(l-3y)(l-4y)k 

+ {l-2y-p b (l)) ’ 

(l-y) 2 (l-2y)(6y 2 -y+l) 

(l-3y)(l-4y)k 

+ (jA b (i)) - 

(l-)/) 2 (l-2y)(6y 2 -y+l) 

{\-3 Y )(l-4 r )k 

, ( (i_2 y ) , / „\\ 

l (1 - 柳 _ 3y) 


if / > 0, 
if / = 0 ， 
if y < p ， 

if p < y < 0, 


if y = p. 


• Drees et al. (2002) stated the following expressions for the AMSE for the 
ML estimator based on a generalized Pareto fit: 


AMSE 


( e ) 


(i+y ) 2 

k 


p(y+l)A 

(l-p)(l-p+y) 


Cl2 


⑴: 


if y > —j, p < 0. 


5.9.5 AMSE optimal A>values 

Below, the optimal values of k that minimize the different expressions of the 
AMSEs are given. 

參 For the estimator y ^ n ： 


kopt = 


\[b(n)r 5l \l+o{\)), 


(l +y+ 2y 2 )(l-p) 2 (l-y) 

(-2p)(l-2y) 


1 

T^Tp 


[b(n)] 


2 

T^2p 


if / > 0, 
if / = 0, 

if / < 0, 
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for the estimator y^ n : 


2(l-p) 4 (l+y+K 2 ) 


扯⑻ r 5/2 (i + o(i ))， 


2(l-p) 4 (l-3/)(l+2y+y 2 -2 K 3 ) 

(-2p)(l-2 K )(l-K) 


if K > 0, 
if K = 0, 


if K < 0, 


for the estimator Mk， n : 


(l + y 2 )(l-yQ) 2 

~P 


i[/ 7 (n)r 3 / 2 (l+o(l)) 


(l-y) 2 (l-2y-p) 2 (6y 2 -y-{-l) 

(-2 / o)(1-2k)(1-3k)(1-4 K ) 

\ / 

P 2 (l-y) 4 (6y 2 -y+l) \ 1 

(-2 y o)(l-2j/)(l-3y)(l-4y )) 


y) 4 (6y 2 -y+l) ( 和 1 -〆)- ifej V 、 1 -好 


(-2p)(l-2y)(l-4y) 


if y > 0, 
if K = 0, 


if y < p. 


if p < y < 0, 


if }/ = p, 


for the estimator y ^： 

^ :2)X):2 +1 ) 2 ) [«2 ⑻ ]- 為 if K > 一 ^ P < 0. 


The corresponding minimal AMSE values are then given by 


for the estimator Y^ opt n ' 


AMSE(hl，J 


广剛 _(1-2 孙 if/>0, 

b 2 (n), if y = 0, 

(- 2 pra- 2 yy \T^ if v< 0 

Y)~P{\-m+Y^Y 2 )~ p ) y ’ 


if y = o, 


{-2~p)P{\-2 Y ) p 

,(l-K)^(l-iO)(l+K+2y2)P 

[/?(«)]"^(1 - 2 p ), 


if K < 0, 
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for the estimator n : 


AMSE(9 k Z op ,，„) 


(~2py 


2 


\2~P{\-P) 2 (\+Y+Y 2 )~ p ) 

[Z7(n)]iAp(l -2p), 

办 2 ⑻， 

( {-2~p)~P{\-2 Y )~ p {\-Y)~ p 、 

y 2P(l-p) 2 (l-y)P(l+2y+y 2 -2y 3 )P ^ 

[Z? ⑻ ] 為(1 -2p), 


2 


if y > o, 

if y = o, 
if y < o, 


for the estimator y^ t n : 

AMSE(Y k M op „ n )= 


((~2py 

V( 1 +kV( 1 -P) 

bin), 


) 1_2p [b(n)] 1 ~ 2 p 


d-2p), 


(-2p)^(l-y)- 2 ^(l-2y) 1 ~^(l-3K)^(l-4y)^ 

(l-2y-p)(6y 2 -y-\-l)P 


2 


[b(n)]~(l - 2p), 

{-2~p)~P{\-2>y)P{\-A Y )' p 

y(l-y) 1+2 ^(l-2y)^- 1 (6K 2 -K+lK 

、 / 

[Z?(n)]^(l - 2p), 


2 

T^lp 


{-2p)P (l-4y)^ M^-rr 


(l_ K )l+2p (1 _ 2y )P-l(6 K 2_ K+1) p A(1 _ y) _ 

[b(n)]~Ap(l - 2p), 


t7(oo) / 


2 


if K > 0, 
if K = 0, 

if y < p, 


if p < / < 0, 


if )/ = p ， 


for the estimator 9^ n : 


AMSE(y^ n ) 


气⑻ ] 忐 (1 


(1 — p)(i - p + y) 


2p). 



6 

CASE STUDIES 

6.1 The Condroz Data 

In this case study, we will concentrate on the Ca content (expressed in mg/100 g of 
dry soil) of soil samples originating from a particular city (NIS code 61072) in the 
Condroz region. Although the Ca content is clearly dependent on other factors such 
as pH level, we ignore this covariate information for the moment and study the 
univariate properties. Figure 6.1(a) displays a histogram of the Ca measurements 
of soil samples from this city. When the main interest is in tail modelling, the 
exponential quantile plot and mean excess plot (which can be considered as a 
derivative plot of the former) form a good starting point. These plots are given in 
Figures 6.1(b), (c) and (d). The convex shape of the exponential quantile plot and 
the increasing behaviour of the mean excess plots in the largest observations give 
evidence of the HTE nature of the tail of the Ca content distribution. 

To assess the fit of a Pareto-type model, a Pareto quantile plot was constructed 
for these data, given in Figure 6.2. Except for the last seven points, the Pareto 
quantile plot is linear in the larger observations. The very largest observations that 
do not follow the ultimate linearity of the Pareto quantile plot are suspect with 
respect to the Pareto-type model. However, in this analysis, we conditioned on the 
city but we did not take into account the possible link with other covariates such 
as pH level. In fact, as can be seen from the Ca versus pH scatterplot given in 
Figure 6.3, both variables appear to be dependent. Moreover, extreme Ca measure¬ 
ments tend to occur more often at the higher pH levels, indicating the need for a 
tail analysis conditional on the covariate pH. We will return to this conditioning 
issue later on in this case study. 

As explained in Chapter 2, the ultimate linearity of the Pareto quantile plot 
can be exploited to construct estimators for the tail index y. In Figure 6.4(a), we 
show the results of the maximum likelihood procedure applied to the exponential 
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0 2 4 6 

Standard exponential quantiles 
(b) 

Figure 6.1 (a) Histogram, (b) exponential quantile plot, (c) ek, n versus k and (d) 

ejc, n versus x n —k, n for the Ca measurements. 

regression model, j/jJ L (solid line), and the Hill estimates, Hk, n (broken line), as 
a function of k. The maximum likelihood estimates j/^ L are stable around the 
value 0.26 and this for k values between 450 and 1500, whereas the Hill esti¬ 
mates show this stability only for k values between 250 and 500. However, the 
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maximum likelihood estimator seems to be more sensitive to the seven observa¬ 
tions considered as suspect with respect to the Pareto-type model: around k = 400, 
there is an abrupt shift of the optimum found by the ML algorithm. The selec¬ 
tion of an optimal k value for the Hill estimator is illustrated in Figure 6.4(b) 
where we plot the estimated asymptotic mean squared error as a function of k. The 
minimum is reached at k opt = 402. The vertical reference line in Figure 6.4(a) and 
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Figure 6.2 Pareto quantile plot of the Ca content measurements 



Figure 6.3 Scatterplot of Ca versus pH. 
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(b) represents this estimated optimal 众 -value. Note that beyond k opt , the Hill esti¬ 
mator diverges from the maximum likelihood estimator. In Goegebeur et al. (2004), 
Burr regression models were fitted to these calcium measurements, thereby taking 
the pH level as covariate. Their analysis identified six points as suspect with respect 
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Figure 6.5 q^ p (solid line), p (broken line) and qpor (broken-dotted line) as 
a function of k fox p = 0.0005. 


to the conditional Burr model. Figure 6.4(c) shows j/jJ L and Hk， n , obtained after 
deletion of these suspect points as a function of k. The corresponding estimated 
AMSE of Hk n is given in Figure 6.4(d); here the optimum is reached a.t k = 391. 
Finally, the bias-corrected estimator q^ p (solid line) and the Weissman estimator 
q^ p (broken line) for 2(0.9995) are given as a function of k in Figure 6.5. 

We now analyse the data using the extreme value techniques developed for 
the general case y g R. Figure 6.6(a) gives the generalized quantile plot (log 鞋^， 
log U H/c’n), k = l y ... ,n — 1, for the Ca measurements. The ultimate linear and 
increasing appearance of the points on this generalized quantile plot gives again 
evidence in favour of a Pareto-type model. Further, following the discussion given 
in Chapter 5, the slope of the ultimate linear part of this plot can again be used to 
construct estimators for y. The generalized Hill estimator y^ n and the Zipf esti¬ 
mator y^ n , both slope estimators, exploit this ultimate linearity and are plotted in 
Figure 6.6(b). Note that y^ n (broken line) gives somewhat higher values for y than 
y^ n (solid line). Figure 6.6(b) also shows the moment estimates „ (dotted line) 
and the GP maximum likelihood estimates (broken-dotted line). Unlike the 

plot of Hjc, n and Figure 6.6(b) does not really show a stable region, making 
inference about the value for y more difficult. Finally, the estimation of extreme 
quantiles on the basis of the GP distribution is illustrated in Figure 6.5, where, next 
to and q^ p , we also show the GP estimate for 2(0.9995) (broken-dotted line). 
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Figure 6.7 Pareto quantile plots of the calcium measurements at (a) pH = 6, (b) 
pH = 6.5 and (c) pH = 7. 


We now return to the conditioning issue. Although the extreme value meth¬ 
ods especially developed for regression problems will be described extensively in 
Chapter 7, some straightforward analyses can be performed on the basis of the uni¬ 
variate extreme value methodology discussed so far. First, the fit of a Pareto-type 
model to the conditional distribution of the dependent variable, given the covari- 
ate(s), can be visually assessed by inspection of Pareto quantile plots of the response 
measurements within narrow bins in the covariate space. Of course, such a procedure 
based on binning can only be expected to perform well when the conditional distribu¬ 
tion of the dependent variable varies smoothly as a function of the covariate(s). This 
is illustrated in Figure 6.7. Given the discrete nature of the covariate pH, binning is 
not really necessary here, and all response observations at a particular pH level can 
be used. The ultimate linearity of the Pareto quantile plots indicates that Pareto-type 
models provide appropriate fits to the conditional distributions of Calcium, given 
pH level. Note, however, that the largest observation in Figure 6.7(b) does not fol¬ 
low the linear pattern set by the other large observations. So, even compared to a 
heavy-tailed model, this point is suspicious and requires special attention. 

The tail heaviness of the response distribution conditional on the covariate 
information can be estimated similarly, using all response observations within a 
narrow bin in the covariate space. Figure 6.8 shows the different y estimates as 
functions of k for the calcium measurements at pH = 7, see also the Pareto quan¬ 
tile plot in Figure 6.7(c). Especially note that, compared to the other tail index 
estimators, is stable over the whole set of k values. 
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Figure 6.7 (continued) 
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Figure 6.8 Conditional tail index estimation for the Condroz data: (a) (solic 
line) and Hk n (broken line) and (b) y^ n (solid line), y^ n (broken line), 
(broken-dotted line) and M、 n (dotted line) as a function of k, for k = 5,..., 203 


using the calcium measurements at pH=7. 
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6.2 The Secura Belgian Re Data 

The Secura Belgian Re data set contains automobile claims from 1988 until 2001, 
which are at least as large as 1,200,000 Euro. The original claim numbers were 
corrected, among others, for inflation. This data set contains n = 371 observations 
and is depicted in Figure 6.9. The ultimate goal of this case study is to provide the 
participating reinsurance companies with an objective statistical analysis in order to 
assist in the pricing of the unlimited excess-loss layer above an operational priority R. 
The analysis performed here is based on the methodology described in Beirlant et al. 
( 2001 ). 

In an excess-of-loss (XL) reinsurance contract, the reinsurer pays for the claim 
amount in excess over a given limit. Formally, let Z denote the claim size, then, under 
an XL reinsurance contract with retention level R, the intervention of the reinsurer 
concerns the random amount (X — R)+. Hence, the net premium Tl(R) is given by 


U(R) = E((X - R) + ) 


(1 一 F(y))dy. 


An important ingredient for establishing the net premium is the mean excess func¬ 
tion. Indeed, since 


we have 


e(R) 


/ ； *d- F(y))dy 

~1 - F(R )~ 


Il(R) = e(R)F(R). 


CD 
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Figure 6.9 Secura data: claim sizes as a function of the year of occurrence. 
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Figure 6.10 (a) Exponential quantile plot of claim sizes, (b) ek, n versus k and (c) 
ejc,n VCrSUS X n —k,n. 



Special emphasis will be put on the level R = 5,000,000 Euro, which is the priority 
used in practice up to 2003. Remark that only 12 observations are larger than 
that level. In order to estimate n(/?), several possibilities are at our disposal: 
purely non-parametric methods, semi-parametric methods given by extreme value 
techniques and fully parametric models where the emphasis lies in trying to model 
the whole outcome set from 1,200,000 Euro. In contrast to this, extreme value 
methods will try to fit the tail of the distribution exclusively, from an appropriate 
(statistical) threshold. Next to the estimation of a net premium, one also needs to 
estimate the probability for a claim to fall in the layer above R. 

6.2.1 The non-parametric approach 

Given the importance of the mean excess function for premium calculations, we 
examine the exponential quantile and mean excess plots first. These are given in 
Figure 6.10. From the exponential quantile plot, a point of inflection with different 
slopes to the left and to the right can be detected. This becomes even more apparent 
in the mean excess plot (Figure 6.10(b) and (c)): behind 2,500,000 the rather 
horizontal behaviour changes into a positive slope. 

Of course, the simplest way to estimate the net premium n (/?) is given by 

1 n 

-Y(Xi-R) + , (6.1) 

n 
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Figure 6.11 Pareto quantile plot. 


Table 6.1 Non-parametnc, Hill and GP-based estimates for Yl(R). 


R 

Non-parametric (6.1) 

Hill (6.4) 

GP (6.5) 

3,000,000 

161,728.1 

163,367.4 

166,619.6 

3,500,000 

108,837.2 

108,227.2 

111,610.4 

4,000,000 

74,696.3 

75,581.4 

79,219.0 

4,500,000 

53,312.3 

55,065.8 

58,714.1 

5,000,000 

35,888.0 

41,481,6 

45,001.6 

7,500,000 

- 

13,944.5 

16,393.3 

10,000,000 

- 

6,434.0 

8,087.8 


For large R values, this simple non-parametric estimator is of course doubtful 
because of the small number of observations on which it is effectively constructed. 
Table 6.1 gives the non-parametric premium estimator (6.1) for some values of R. 

6.2.2 Pareto-type modelling 

We now further investigate the tail behaviour of the claim size distribution. The 
Pareto quantile plot given in Figure 6.11 is approximately linear in the largest 
observations, indicating a good fit of the Pareto model to the tail of the claim size 
distribution, though at the highest observations, the trend flattens out. Again, the 
tail index y can be estimated by measuring the slope of this ultimate linear part. 
Figure 6.12(a) shows the exponential regression model-based maximum likelihood 



slrjT - omIn 寸一 . 

®z!s e!bo)60i 




192 






Ol- oo d90 寸 d OJ d0.0 OJT— dcnd oo 0d90d 寸 od OJ od0.0 

BEES 3S1AIV 




CASE STUDIES 


193 


X _ 

~ 1 I I I 

100 200 300 

k 

(d) 

estimates 纪 L and the Hill estimates n as a function of k. The vertical reference 
line at 众 = 95 represents the estimated optimal 众 -value, in the sense of minimum 
asymptotic mean squared error, for the Hill estimator; see Figure 6.12(b). Note 
that and H、 n are almost indistinguishable for ^-values between 50 and 100; 
beyond this interval, the bias of the Hill estimator becomes important while the 
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maximum likelihood estimator remains stable. Next to premium estimation, special 
attention has to be paid to estimating the probability of exceeding the retention 
R = 5, 000, 000. The Weissman estimate for P(X > 5, 000, 000) is given as a 
function of k in Figure 6.12(c). The horizontal reference line is the empirical 
exceedance probability, that is, 12/371. Finally, Figure 6.12(d) contains the bias- 
corrected estimate q^ p (solid line) and the Weissman estimate q^ p (broken line) 
for the 0.999 quantile. 

We now turn to XL rating under the Pareto-type model. Recall the basic formula 
for calculating the net premium of an XL contract is 


Yl(R) = e(R)F(R) 

or, after dividing and multiplying the right-hand side by R, 

e(R) - 

Tl(R) = -^RF(R). 


For Pareto-type models with y < 1, application of Karamata’s theorem (Theorem 
2.3) gives 


so that 


e(R) 

R 


/^° u~y iF(u)du 
R~y +l l F {R) 



as ^ oo, 


n ⑻〜⑻， 

—— 1 

Y 


R oo. 


When the priority R is situated within the sample, that is, R = X n —k， n , the net 
premium Tl(R) can be estimated by 




~^n—k,n 


n 


(6.3) 


where denotes an estimator for y based on k upper order statistics. If R is not 
fixed at one of the sample points, extreme value formulas can be used to estimate 
F(R). Indeed, for Pareto-type models 


so that 



> 


x\X > t 


〜 x y t 


P(tx) 〜 F(t)x~y 


or, replacing R = tx, 

P(R) 〜 F(t) O V R> t. 
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Let k denote an appropriate adaptive choice for the number of extreme order 
statistics and set t = n , then Yl(R) can be estimated as 


n(R) = 



(6.4) 


In Table 6.1, we illustrate the use of (6.4) for some values of R. 


6.2.3 Alternative extreme value methods 

In this section, we apply the extreme value methodology developed for the general 
case }/ g M to this data set. As a first step, we compare several tail index estimators; 
next we discuss net premium estimation. 

First, consider the estimation of the tail parameter y. In this respect, the gener¬ 
alized quantile plot is a good starting point as the pattern formed by the U Hk， n for 
small k values gives an indication about the tail behaviour. For the Secura data, 
the generalized quantile plot is given in Figure 6.13(a). The ultimate linear and 



Figure 6.13 (a) Generalized quantile plot, (b) y^ n (solid line), y^ n (broken line), 

(broken-dotted line) and Mk, n (dotted line) as a function of k, (c) POT (broken 
line), Weissman (solid line) and empirical estimate for P(X > 5, 000, 000) and (d) 
estimates for 2(0.999): POT-based (broken-dotted line), q^ p (solid line) and p 
(broken line). 
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increasing behaviour indicates a heavy-tailed or Pareto-type model for the claim 
size distribution, which is in line with the previous analysis. Figure 6.13(b) shows 
y^ n (solid line), y^ n (broken line), (broken-dotted line) and „ as a function 
of k. The estimation of P(X > 5, 000, 000) is illustrated in Figure 6.13(c) using the 
Weissman estimates (solid line), the GP approach (broken line) and the empirical 
exceedance probability (horizontal reference line). Finally, Figure 6.13(d) displays 
some estimates for 2(0.999). 

We now consider net premium calculations on the basis of the GP tail fit. Since 


F(x) 〜 F(u) 1 + y 


x — u\~ 

g ) 


we have that for u sufficiently large, provided y < l, 

e(R) ^ (l + y^—^) , R>u. 


Setting u = X n _j c n , where k denotes again an appropriate choice for the number 
of upper order statistics, n can be estimated as 


fH R) = ^U + n R -^Y^ ( 6 ,) 

« 1 - K* V ^ / 

for some and y^. In Table 6.1, we illustrate the use of (6.5) for premium calcu¬ 
lations using the Secura data. Note that the net premiums obtained with the three 
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estimators agree quite well. Compared to the estimates based on extreme value 
methodology, the simple non-parametric estimator clearly does not yield sensible 
results in case R is outside (or near the end of) the data range. 


6.2.4 Mixture modelling of claim sizes 

In the previous sections, we discussed XL rating using Pareto-type and GP mod¬ 
elling of the tail of the claim size distributions. This resulted in estimators for 
Tl(R) in case R exceeds some sufficiently high threshold x n _& n . When trying to 
estimate Tl(R) for values of R smaller than the threshold x n _^ n , one needs a global 
statistical model describing the whole range of the possible claim outcomes. The 
mean excess function given in Figure 6.10(c) suggests a mixture of an exponential 
and a Pareto distribution: 


^Exp—Par( w ) — 



1-exp (- 又 (m- 1,200,000)) 
exp(-A(X n _^ n -l,200,000)) 

-1/y ’ 


if 1, 200, 000 < w < X n _ in , 
if m > X n _ hi , 


with k = 95 and X = 1/955, 676.55. In Figure 6.14(a), we plot the empirical distri¬ 
bution function (solid line) together with the fitted Exp-Par mixture model (broken 
line). As is clear from this plot, the Exp-Par mixture model describes the data quite 
well. The fit of the Exp-Par mixture model can be further assessed by transforming 
the data to the Exp(l) framework as follows: 


Ei = — log(l — 户 Exp-Pard))，/ = 1， . . . ， W ， （ 6.6) 


followed by a visual inspection of the exponential quantile plot, see Figure 6.14(b). 
The use of the Exp-Par model for premium computations is illustrated in Table 6.2. 


Table 6.2 h(R) 
based on the Exp-Par 
mixture model. 


R 

ft ⑻ 

1,250,000 

944,217.8 

1,500,000 

734,371.6 

1,750,000 

571,314.1 

2,000,000 

444,275.5 

2,250,000 

344,965.2 

2,500,000 

267,000.7 
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6.3 Earthquake Data 


As a third case study, we analyse the earthquake data introduced in Pisarenko and 
Sornette (2003). This data set is extracted from the Harvard catalog and contains 
information about the seismic moment (in dyne-cm) of shallow earthquakes (dept < 
70 km) over the period from 1977 to 2000. In Pisarenko and Sornette (2003), the 
tails of the seismic moment distributions for subduction and midocean ridge zones 
are compared by fitting the GP distribution to seismic moment exceedances over 
10 24 dyne-cm. 

The exploratory analysis described in Chapter 1 (Figure 1.17) already indicated 
for both subduction and midocean ridge zones a HTE behaviour of the seismic 
moment distribution. This is further confirmed by the Pareto quantile plots shown 
in Figure 6.15. Note, however, that the Pareto quantile plots bend down at the 
very largest observations, indicating a weaker behaviour of the ultimate tail of the 
seismic moment distribution. 

In Figure 6.16(a) and (b), we show the maximum likelihood estimates j/jJ L 
(solid line) and the Hill estimates Hk n (broken line) as a function of k for subduc¬ 
tion and midocean ridge zones respectively . The subduction zone seismic moment 
distribution is clearly heavier-tailed than the midocean ridge distribution, a result 
that is consistent with the analysis performed in Pisarenko and Sornette (2003). 
Note that only at the very smallest k values Hk， n and y^ L agree quite well. Beyond 
these small k values, both estimates tend to increase as a function of k, albeit at a 
different rate. The selection of an optimal k value for the Hill estimator is illustrated 
in Figure 6.16(c) and (d) where we plot the estimated asymptotic mean squared 
errors as a function of k. Imposing the restriction that k should be at least 20, 
the minimum is reached at k opt = 1157 for subduction zones and at k opt = 58 for 
midocean ridge zones. The vertical reference lines in Figure 6.16 represent these 
estimated optimal k values. The use of these estimated optimal k values is fur¬ 
ther illustrated on the Pareto quantile plots given in Figure 6.15 by superimposing 

. 0) ) with slopes 

.kop.t， n j 

j = 1 refers to subduction zones and j = 2 refers to midocean ridge zones. The 
horizontal reference lines in Figure 6.15 represent the threshold used in Pisarenko 
and Sornette (2003). 

So far, the data for subduction and midocean ridge zones were considered 
independently of each other. However, as described in Beirlant and Goegebeur 
(2004b), combining data originating from several independent data groups may 
result in improved efficiency. Of course, regression models with dummy explana¬ 
tory variables describing the groups can be used in combination with classical 
extreme value models such as the GEV or GP. This regression approach will be 
further developed in Chapter 7. In this section, we concentrate on a straightforward 
extension of the exponential regression model for log-spacings of successive order 
statistics introduced in Chapter 4. 

Consider independent and identically distributed positive random variables 
x[ j \ …， with a common distribution function F x (j 、，j = 1,..., G, where 


M ,7 = 1, 2; where 

K opt， n J 


the lines through (log 


»j+i 
⑶ + 1 


logx 、 
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G denotes the number of groups. Assume further that the G groups are indepen¬ 
dent of each other and that the response distributions are of Pareto-type, that is, 
the tail quantile functions U x u ), j = 1, ..., G, satisfy 
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where and ij denote the extreme value index and the slowly varying function 
of group j respectively. 

As in a classical one-way ANOVA situation, we introduce the parametrization 
Yj = A) + 〜， 7 = 1,..., G, with X ^ =1 〜= 0, so that the parameters Pj denote 
the difference of the extreme value index of group j with respect to the global aver¬ 
age overall groups. This transformation will now be combined with the following 
linear model describing the estimation problem of every yj，j = 1 , , G. 

Under the second order condition (3.14) on the ij, j = 1,, (7，it can be 
shown as in Beirlant et al. (1999) that the following regression model holds approx¬ 
imately 

i = 1, …， k, (6.8) 

with bj and Xj denoting the function b and the parameter r respectively, of group 
j and the F^\ i = l,k, independent standard exponential random variables. 

The classical way to estimate the parameters yj, j = 1 ,... ， G is then given 
by the Hill (1975) estimates that are obtained as maximum likelihood estimates 
by omitting the terms in model ( 6 . 8 ) (these terms tend to 0 as 

rij oo and k/nj —> 0 ) leading to a simple average of the scaled log-spacings 
— log ^nj-i / = 1 ,..., A:, as an estimator of yj, and hence 

1 G 

h 二刁 TA% and = H Kn, - ^ j=U...,G, (6.9) 

7=1 

in which denotes the Hill estimator for group j 

1 灸 

Hi% = 1 ： T, ] °SX^_ i+hn .-\ogX^_ knj . ( 6 . 10 ) 

i=l 

Introducing A = Block-diag (yjh ； j = 1,..., G) and the kG x G matrix 



with 1 denoting a /^-vector of ones, we find that the asymptotic covariance matrix 
of P = (Po, 3 i ， … ， ^g-i) is given by 

Acov(p) = (Z/A— 吃)-、 


( 6 . 11 ) 
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On the other hand, the main term of the bias of the estimators (when rij oo and 
k/rij 0) is given by 


Abias(^o )= — 
u 


E 

7=1 


M 按 ) 


( 6 . 12 ) 


Abias ( 灼） 


M 错 ) 
1 + 0 


G 


ri 


V 


1 ， ... ， G-1. (6.13) 


i=i 


Application of the estimators defined by (6.9) and (6.10) involves the selection 
of the number of extreme order statistics k to be used in the estimation. Remark 


that we take the tail sample fraction k equal for all groups. If k is chosen too small, 
the resulting estimators will have a high variance. On the other hand, for larger k 
values, the estimators will perform quite well with respect to variance but will be 
affected by a larger bias as observations are used that are not really informative 
for the tail of F x (j 、， j = 1,..., G. Hence, an appropriate k value should represent 
a good bias-variance trade-off. Here, we will use the trace of the asymptotic mean 
squared error (AMSE) matrix as optimality criterion. 

Defining the AMSE matrix ^ of ^ as 


Q(k) = {L'K~ l L)- 1 + kk', (6.14) 


with k denoting the G-vector containing the asymptotic bias expressions given by 
(6.12) and (6.13), the optimal number of extremes to be used in the estimation, 
k opt , is defined as 


k opt — argmin tr Q(k). 

Note that Q(k) depends on the unknown yj, tj, j = l,..., G, and k = 

1,..., — 1, 7 = 1, ..., G, which implies that the optimal k has to be derived 

from an estimate of Q(k). The following algorithm is used to estimate k opt and 
hence y』，j = 1, , G ， adaptively: 

1. Obtain initial estimates of yj, tj, j = 1,..., G, together with estimates of 

k = l,..., rij — 1, j = 1,..., G, 

2. for A: = 2, ... ， min{rij ; j = 1,..., G} — 1: 
compute tr Q(k) and let 


k op t = arg min tr Q{k), 


3. repeat step 2 but with the parameter estimates obtained from using a common 
k and obtain an update of the parameter estimates. 
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Figure 6.18 Pisarenko and S ornette data: tr as a function of log (A:). 


The initial estimates for the unknown parameters (cf. step 1) are obtained by fitting 
model (6.8) to the k largest observations of each group using a maximum likelihood 
method (see Beirlant et al. (1999)). 

Inference about the regression vector p can be drawn using a likeli¬ 
hood ratio test statistic. For k/rij, j = l,G, sufficiently small, the slowly 
varying nuisance part of (6.8) can be ignored and hence inference can be based on 


the reduced model i (log X K nj _ i+ln . - A nj — injJ ⑼ 丁 hj ” 、 ， t _ 丄， ... ， 凡 , 
j = 1,..., G. As in a ’classical’ one-way ANOVA situation, the hypothesis of 
main interest is Ho : ^\ = ... = Pg-i — 0- 

We now return to the Pisarenko and Sornette (2003) earthquake data. The 
procedure described above with k >20 yielded k opt = 97 with // 97 6458 = 1.232 




and Hq 


= 0.821. In Figure 6.17, we show the Pareto quantile plots of the 


seismic moments for (a) subduction zones and (b) midocean ridge zones on which 
we superimposed the lines through (log(^iL), logx()) △ ) with slope , 

k op t+l rij-k 0 pt,nj k opt ,nj. 

j = 1,2 (solid lines). For the hypothesis test of no difference between the tail 
heaviness of the seismic moment distribution of subduction and midocean ridge 
zones, a likelihood ratio statistic of 7.92 was obtained, resulting in a rejection of 
Hq. The GP-based approach described in Pisarenko and Sornette (2001) yielded 
tail index estimates of 1.51 and 1.02 for subduction and midocean ridge zones 
respectively, so our results are slightly more conservative. Likewise, these authors 
found significant differences in the tail heaviness of the seismic moment distri¬ 
butions. As mentioned before, the Pareto quantile plots bend down in the largest 
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observations. Nevertheless, these largest observations still form more or less a 
straight line pattern. So, also the ultimate tail could be described by a Pareto-type 
law. This fact is further illustrated in Figure 6.18 where we plot tr 0(/:) as a func¬ 
tion of log(/:). Relaxation of the constraint that k should be at least 20 results 
in the global optimum k opt = 12 with ^ 1^6458 = 0.541 and H 这 ) 1665 = 0.427. In 
Figure 6.17, the resulting optimal fits are plotted with dotted lines. At k opt , the null 
hypothesis of no difference in tail behaviour cannot be rejected on the basis of the 
above-described likelihood ratio test statistic. 



REGRESSION ANALYSIS 


From the discussion in the previous chapters, it became clear that the literature 
on the estimation of tail characteristics based on an i.i.d. sample is very elabo¬ 
rate. However, a major statistical theme is the description of a variable of primary 
interest (the dependent variable) in terms of covariates. This regression point of 
view has been studied much less extensively in extreme value analysis. Further, 
by using covariate information, data sets originating from different sources may 
be combined, resulting in opportunities for better point estimates and improved 
inference. From an extreme value point of view, interest is mainly in estimat¬ 
ing conditional tail indices, extreme conditional quantiles and small conditional 
exceedance probabilities. The available methods together with their references can 
be grouped in four sets, along 


• the method of block maxima ， fitting the GEV to a sample of maxima, taking 
one or more of the GEV parameters as a function of the co variates, 

• the quantile view, extending the exponential regression models for log- 
spacings of successive order statistics to handle covariate information (Beir- 
lant and Goegebeur (2003)), 

• the probability view, or POT method, where GP distribution-based regres¬ 
sion models are fitted to exceedances over a high threshold (Davison and 
Smith (1990)), 

• non-parametric estimation procedures resulting from combining modern 
smoothing techniques such as maximum penalized likelihood estimation 
(Green and Silverman (1994)) and local polynomial maximum likelihood 
estimation (Fan and Gijbels (1996)) with models for extreme values (Davi¬ 
son and Ramesh (2000), Hall and Tajvidi (2000a), Chavez-Demoulin and 
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Davison (2001), Pauli and Coles (2001), Chavez-Demoulin and Embrechts 
(2004), Beirlant and Goegebeur (2004a)). 

Before entering into the regression analysis with response distributions in maximal 
domains of attraction, we recall some facts from classical regression techniques. 

7.1 Introduction 

The aim of regression analysis is to construct mathematical models that describe or 
explain relationships that may exist between variables. In general, we are interested 
in just one variable, the response or dependent variable, and we want to study how 
its distribution depends on a set of variables called the explanatory or independent 
variables. We denote the dependent variable by Y and the vector of covariates 
by x, that is, x’ = (xi, …， Xd). The covariates are assumed non-random. Linear 
regression analysis is one of the oldest and most widely used statistical techniques. 
The general linear model links the dependent variable to the covariates in an 
approximate linear way: 


Y = 

where p denotes the vector of regression coefficients, that is, = (^i,..., Pd), 
and e is the model error with e 〜 A^(0, a 2 ), or equivalently 

F|x 〜 a 2 ). 

Note that the response distribution depends on the covariates through its mean. The 
general linear model can be extended in different ways. Various non-linear or non¬ 
normal regression models have been studied on an individual basis for many years. 
In 1972, Nelder and Wedderburn (1972) provided a unified and accessible frame¬ 
work for a class of such models, called generalized linear models (GLMs). Within 
this class of generalized linear models, the distribution of the dependent variable 
is assumed to belong to the one-parameter exponential family of distributions, with 
density function 

fOy - b(0) 

f(y,0,(t)) = exp (- - -- h c(y, 0) 

The parameter 0 is the natural parameter of the exponential family and 0 is a 
nuisance or scale parameter. Dependence on the covariates is modelled through 
the mean of the dependent variable using a link function g: 

g(E(Y\x)) = p r x, 



where the link function g is monotone and differentiable. The general linear model 
is a specific member of this family of generalized linear models with a normal 
response distribution and identity link function g(u) = u. 
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Note that when dealing with heavy-tailed distributions, the population moments 
may not be finite, and hence the above-described techniques cannot be used for 
statistical analysis. Further, from an extreme value point of view, the main inter¬ 
est is in describing conditional tail characteristics such as conditional tail indices, 
extreme conditional quantiles and small conditional exceedance probabilities rather 
than modelling conditional means. A straightforward approach to tail analysis 
in the presence of covariate information consists of modelling one or more of 
the parameters of a univariate model F, with F g V(G y ), as a function of the 
covariates. In this, parametrizations can be chosen such that the distribution of 
the response variable depends on the covariates through the extreme value index. 
Because of its flexibility, the Burr( 77 , it, 入 ） distribution could, for instance, be used 
to model heavy-tailed data when paying special attention to tail behaviour. Note 
that for the Burr(^, r, A) distribution y = 1/ 入 r, so in case the main interest is 
in describing conditional tails 入 and/or r may be taken as a function of the 
covariates (see Beirlant et al. (1998)). This approach results in fully paramet¬ 
ric statistical models. These global models are fitted to all available data rather 
than just to the tail observations and hence do not always provide sufficient flex¬ 
ibility for accurate tail modelling. Therefore, in the subsequent sections, we will 
focus on regression techniques aimed directly at describing tails of conditional 
distributions. 

7.2 The Method of Block Maxima 

7.2.1 Model description 

From Chapter 2, we know that the only possible limit distributions for a sequence 
of normalized maxima are the extreme value distributions. On the basis of this 
result, the extreme value index can be estimated by fitting the generalized extreme 
value distribution 


G(y; a, y, ju)= 


exp(-(i + y 宁 ) 4 )， i + r^>o, r ^o, 

exp(-exp(-^)), y e K, / = 0 , 


(7.1) 


with /x G M and a > 0 to a sample of maxima. When covariate information is 
available, it is natural to extend (7.1) to a regression model by taking one or 
more of its parameters as a function of the co variates. We discuss the estimation 
problem in its full generality in the sense that the GEV parameters are considered 
functions of both the covariate vector and the vector of model parameters, that 
is, cr(x) = h\(x; ^ 1 ), y(x) = h 2 (x; p 2 ) and /x(x) = h 3 (x; ft), with hi, h 2 and 
completely specified functions. In the subsequent discussion, the GEV distribution 
is referred to as GEV(cr, y, /x). 
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7.2.2 Maximum likelihood estimation 

Consider Y\,... ,Y m independent random variables and let x, represent the covari¬ 
ate vector associated with Y[ such that 


D/ 〜 GEV(cr(Xi), y(Xi), /x(x/)), / = 

Denoting by p the complete vector of model parameters, that is ， 
the log-likelihood function is simply 


logL(^) = ^logg(F,-; cr(Xi), y(Xi), /x(x,)) (7.2) 

i=l 

where g is the GEV density function 

g(y\ a, y, n) (7.3) 

= H 1 + ex P (- (! + - 1 + K^ >0,y ^0, 

i-exp(-^)exp(-exp(-^)), y e R，y = 0. 

The maximum likelihood estimator ^ can be obtained by maximizing (7.2) with 
respect to Approximate asymptotic inference follows in the usual way from the 
inverse information matrix or the profile likelihood function. 

If the data are not exactly GEV distributed but instead we have a sample 
of maxima at our disposal, then, following condition (C Y ), the GEV can still be 
used as an approximation to the true maximum distribution. The above-described 
maximum likelihood procedure is then applied to a sample of maxima. Recall 
that, in this case, the parameters a and /x absorb the normalizing constants a n 
respectively b n in the derivation of the limit laws for maxima given in section 2.1. 
Hence, the parametrization for o and /x follows immediately from these. In prac¬ 
tice, we usually have no knowledge about F and U, and setting up an appropriate 
GEV parametrization in terms of x is often difficult. Simulation results, however, 
indicate that incorrect specifications may lead to unreliable point estimates. One 
possible solution for this problem is to consider a broad class of distribution func¬ 
tions satisfying (C y ) followed by the determination of the appropriate /x and a 
parametrizations and the resulting limiting form. For instance, ignoring the depen¬ 
dence on the covariates for notational convenience, in case of the Hall class (Hall 
(1982)) for which 


U(y) = 


Cyy(l + DyP(l+o(m, y>0, 
y+- Cyy(l + DyP{\ + o(l))), / < 0, 1 - 


with C > 0, D g M and p < 0, we can take, for n —> oo, 


b n = U (n ) 〜 


Cn y , y > 0, 
: - Cn y , y <0, 


= I y \ Cu /, 
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and hence, setting z = b n + a n y. 


P{Yn,n S Z ) 〜 


Clearly, for this very broad class of distribution functions, the appropriate [i and a 
parametrizations can be easily obtained. Of course, application of (7.4) still requires 
to specify y and possibly C in terms of the covariates. Note also that since /x and 
a depend on n, model (7.4) can accommodate data sets with different subsample 
sizes. 

Example 7.1 The Ca against pH scatterplot shown in Figure 1.19(a) gives an indi¬ 
cation that the tail of the conditional Ca distribution may depend on the pH level 
of soil samples. We analyse these data using the GEV regression approach dis¬ 
cussed above. The Pareto quantile plots of the Ca measurements at some fixed 
pH levels given in Figure 6.7 indicate that Pareto-type models provide appropriate 
fits to the conditional distributions of Ca given pH. Further, following Goegebeur 
et al. (2004), we model the extreme value index y in terms of the pH level using 
an exponential link function. The data set is preprocessed in the sense that obser¬ 
vations identified as suspect or incorrect are excluded from the analysis. At each 
pH level, we compute the maximum Ca value and the number of available Ca 
observations, see Figure 7.1(a) and (b) respectively. The GEV-Hall model (7.4) 
with y(pH) = exp(00 + fiipH) is fitted to these 30 subsample maxima using the 
maximum likelihood method. This results in the point estimates 

C = 96.4173, 

禹 =—2.8406, 

Pi = 0.2908. 

The profile likelihood function and the profile likelihood-based 95% confidence 
interval for p\ are shown in Figure 7.1(c). The confidence interval for p>\ does not 
include the value 0, so at a significance level of 5%, the hypothesis = 0 

can be rejected. Hence, the tail heaviness of the conditional Ca distribution varies 
significantly with the pH level of soil samples. 

7.2.3 Goodness-of-fit 

Having fitted a model to a data set, for instance, using maximum likelihood meth¬ 
ods, one should evaluate how well the model describes or explains the available 
data. This is especially true here since the complicated model was based on the 
asymptotics of maxima. When dealing with regression models, the goodness-of-fit 
typically is visually assessed by inspection of various kinds of residual plots. In 
the present context, the classical residuals Y( — ^, where 乞 denotes some measure 
of location, are not very useful as, in general, these are not identically distributed. 


ex p (- (☆)—”， y >0 ^ 
ex P (-( 吿 ? ) 7 )，K <0. 


(7.4) 
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Therefore, we will look for other quantities or random variables that satisfy the 
i.i.d. property, and hence can be considered as generalized residuals. Consider 

O/ 〜 GEV(cr(Xi), y(Xi), /x(x z )), i = 1 , …， m. 


The transformation 




Yj-^ixj) 

o-(x,) ’ 


y(Xi) = 0, 


results in a standard Gumbel distributed random variable Ri (Coles (2001)): 


FRi\ Xi (n) 


I G + ^y[exp(y(xi)r；) - 1 ]; ct(x,), y(x,), 

I + cr(x;)r,.; ct(x,), /(x,), m(x ；)), 

exp(-exp(-r,)). 


K(x,) ^ 0, 
K(x,) = 0, 
(7.6) 


The resulting Gumbel distribution does not any longer depend on the covari¬ 
ates, and hence the random variables R\,..., R m are identically distributed. 
If Y\,... ,Y m are assumed independent, then R\,..., R m are also independent. 
Hence, analogously to the case of classical linear regression, , R m can be 

used to construct several kinds of residual plots. Here, we will concentrate on 
residual quantile plots. The quantile function associated with (7.6) is given by 



ST 寸 61S6T961Z6TS6T 

d~7o)ol 


Q(p) = - log (— log/7), 


0 < p < 1, 




216 REGRESSION ANALYSIS 

yielding the Gumbel quantile plot coordinates 

(-log (- log ， R i,m ) ， i = h …, m. 

In case (7.6) provides an accurate description of the data, we expect the points on 
the Gumbel quantile plot to be close to the first diagonal. 

Alternatively, as discussed in Chapter 1, we can always return to the exponential 
framework by transforming the data first to the exponential case followed by a 
subsequent assessment of the exponential quantile fit. To do so, note that 

G(Y; a(x), y(x), /x(x)) = U, 
where 〜 f/(0, 1) and hence 


—log(l — G(F; a(x), K(x), M (x))) = E, (7.7) 


with 五〜 Exp(l). On the basis of this, the fit of the GEV regression model can 
be assessed by constructing the plot 



m + 1 



… ， m, 


where Ui = G(7 Z ; a(x/), y(x/), /x(x z -)) and U hm < < U m ^ m are the correspond¬ 

ing order statistics, and to inspect the closeness of the points to the first diagonal. 


Example 7.1 (continued) We now evaluate how well the GEV-Hall model (7.4) 
with y(pH) = exp()i3o + ^\pH) describes the conditional Ca-maxima using the 
above-introduced quantile plots. In Figure 7.2(a), we show the Gumbel quantile 
plot of the generalized residuals (7.5). Taking the small sample size and the high 
variability of the subsample sizes over the pH range into account, we can con¬ 
clude that the GEV-Hall regression model describes the conditional Ca-maxima 
distribution quite well. Alternatively, we can evaluate the fit on the basis of the 
exponential quantile plot of the generalized residuals (7.7)，see Figure 7.2(b). 


7.2.4 Estimation of extreme conditional quantiles 

Extreme conditional quantile estimates can be obtained by inverting the conditional 
GEV distribution function yielding 


g( x ) + 鵠 [(- lo gd - p)r yW — i] ， k(x)^o, 

M(x) - cr(x) log (- log(l — p)), y(x) = 0, 


and replacing the unknown parameters by their respective estimates. 

Example 7.1 (continued) We continue with the analysis of the conditional Ca- 
maxima. The estimated conditional 0.95 quantile of the Ca-maxima distribution is 





4.5 5.0 5.5 6.0 6.5 7.0 7.5 

pH 

Figure 7.3 Condroz data: conditional Ca-maxima (circles) and estimated condi¬ 
tional 0.95 quantiles (squares). 

given as a function of pH in Figure 7.3. Note that one observation exceeds the 
estimated 0.95 quantile (we expect one or two observations out of 30 above the 
conditional 0.95 quantile). 

Note that in case the GEV regression model is used to approximate the true 
conditional distribution of the largest value in a sample, (7.8) yields the quantiles 
of the conditional maximum distribution. The quantiles of the original data can be 
obtained from 

/ ^( x ) + 鬻 [(- lo gd - prr Hx) - 1], /(x) ^ o, 

At( x ) - a(x) log (- log(l - p) n ), K(x) = 0. 

7.3 The Quantile View — Methods Based on 
Exponential Regression Models 
7.3.1 Model description 

In this section, we extend the exponential regression model for log-spacings of 
successive order statistics introduced in Chapter 4 to the regression case. This 
approach has only been worked out in case of Pareto-type response distributions. 

Consider Y\,... ,Y n i.i.d. random variables according to distribution function 
F, where F is of Pareto-type, that is, the tail quantile function U satisfies 


^P，X = 
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U(y) = ynu(y), 


y > l; y > 0. 


(7.9) 
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From section 4.4, we know that, when logiu satisfies C p (b) for some p < 
0 and b g 1Z P , log-spacings of successive order statistics can be approximately 
represented as 


；'(log F„- y+ i,„ - log Y„—j，„) ~ + b n ^ k (^Y) ) 

j = 1 ， … ， k ， (7.10) 

where E\ ，…， Ek are independent standard exponential random variables and 
b n ，k = When covariate information is available, (7.9) can be extended to a 

regression model by modelling y and possibly iu as a function of the covariates. In 
this case, the exponential regression model (7.10) cannot be directly applied to the 
response observations as these are not identically distributed. One possible solution 
is to transform the response observations into generalized residuals, thereby remov¬ 
ing (at least partly) the dependence on the co variates. These generalized residuals 
then form the basis for applying (7.10). 


7.3.2 Maximum likelihood estimation 

Consider independent random variables Y\,... ,Y n with respective associated 
covariate vectors xi,..., x„ such that the conditional distribution of Y given x 
is of Pareto-type, that is, for some y(x) >0 

1 - F Y}x (y) = y- 1/r(x) i F (y, x). (7.11) 

As above, we set y(x) = h(x; fi), for some completely specified function h and 
with ^ denoting a vector of regression coefficients. Note that aside of the extreme 
value index y, Fy\ x may also depend on the covariates through Since Y\,... ,Y n 
are not identically distributed (7.10) cannot be directly applied to the raw input 
data. However, by transforming the dependent variables, the dependence on the 
covariates may be at least partly removed. The transformation 

R = Y 1/yM (7.12) 

standardizes the extreme value index: 

1 - F R \ x (r) = r~ l i F (r Y{x) \ x). 

Next, we restrict the class of distribution functions satisfying (7.11) to the distri¬ 
butions for which 


l F { r y^-^) = t F {r), (7.13) 

or equivalently, to the class of conditional distribution functions for which trans¬ 
formation (7.12) removes the conditioning on x completely. The random variables 
Ri,, R n obtained by applying (7.12) to Y n are now clearly independent 

(since the F, are independent) and identically distributed (y and £/r no longer 
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depend on x). The Ri, i = l,n again form the basis of the statistical analysis. 
We denote the order statistics associated with R\, … ， R n by R hn < < R n ， n . 

In case logijj satisfies C p (b) for some p < 0 and b e 1Z P , using derivations 
similar to the ones in section 4.4，the following approximate representation for 
log-spacings of generalized residuals can be proposed 

4 + j 二 1 ， … ， k ， (7.14) 

with Zj = y (log R n -j+i,n - log R n -j , n )， b n ，k = and E u ...,E k denoting 

independent standard exponential random variables. The regression coefficients can 
be estimated jointly with b n ，k and p using the maximum likelihood method. 

The log-likelihood function for Z\,..., Zk is given by 

1 t^n,k 



log L ( 卩， b„ ik , p) = - E log I 



Zj 

1 + b '^k (l+l) 


(7.15) 


Note that the likelihood function depends on the regression coefficients through 
the ordered residuals and hence is more complicated than in section 4.4. For com¬ 
putational details concerning the numerical maximization of (7.15), we refer to 
Beirlant and Goegebeur (2003). Inference about the regression coefficients can be 
drawn using the profile log-likelihood ratio test statistic given by 2(log L p (^( 0 )) — 
log L p (P^)) with log L p (P( 0 )) denoting the profile log-likelihood function of some 
subset /3(0) of p. This statistic equals the classical likelihood ratio statistic for test¬ 
ing the hypothesis Ho'fi ⑼ = /3(* 0 ). As discussed in Beirlant and Goegebeur (2003), 
the classical x 2 approximation to the null distribution of the test statistic is inap¬ 
propriate. We therefore propose to simulate the reference distribution by using 
a parametric bootstrap procedure (Efron and Tibshirani (1993)). Bootstrap sam¬ 
ples are generated from a strict Pareto distribution with parameters 0(* o > and the 
maximum likelihood estimates of the remaining regression coefficients given f(* 0 ). 

Example 7.2 We illustrate the proposed procedure with the diamond data intro¬ 
duced in section 1.3.5. In a first attempt, trying to fit regression models over the 
whole range of the variable size, the application of model (7.11) with Y = value 
and y(size) = exp(^o + size) does not provide an appropriate fit: the Pareto 
quantile plot of the residuals R = y ex P( _ ^i size ^ becomes horizontal for the largest 
observations, that is, exp(^o) = 0 (cf infra). Rather, the extreme value index is 
found to vary polynomially with size. The scatterplot of value versus \og(size) is 
given in Figure 7.4(a). In Figure 7.4(b), we show the profile log-likelihood function 
of for k = 200, 250, 300, 350, 400. These profile likelihood functions clearly 
indicate a estimate of approximately 0.3. Finally, Figure 7.4(c) and (d) contain 
the maximum likelihood estimates of respectively and as a function of k. 
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plot. Alternatively, following (7.14) 


Rj 




1 + b n 


㈤ 




are approximately distributed as an i.i.d. sample from the standard exponential 
distribution and can be used to construct an exponential quantile plot. 


Example 7.2 (continued) Figure 7.5(a) shows the Pareto quantile plot of the gen¬ 
eralized residuals logr ? - = size^ 1 logvaluei, i = 1 ， … ， 1914. The residuals are 
computed using the estimate dXk = 200. The Pareto quantile plot is clearly lin¬ 
ear in the largest observations, indicating a good fit of a Pareto-type model to the 
residual distribution. In a similar way, we also constructed a Pareto quantile plot 
of the residuals logr ? = Qxp(—fi\sizei) log valuer, see Figure 7.5(b). The ultimate 
horizontal appearance indicates that the residual distribution cannot be adequately 
described by a Pareto-type model. 

7.3.4 Estimation of extreme conditional quantiles 

In case of an i.i.d. sample from a Pareto-type distribution, Y\,... ,Y n , extreme 
quantiles can be estimated by extrapolation along a line through the anchor point 
(log log with slope on the Pareto quantile plot, resulting in the 

estimator (see e.g. Weissman (1978)) 

/ k + \ y k , 

Qy,k(p) = y n -k,n ( 7 ~~— —■ ， k = 1 , ..., n — 1 , (7.16) 

l)p / 

where 么 denotes an estimator for y based on the k largest order statistics. When 
covariate information is available (7.16) cannot be applied directly to the raw data 
since the observations are not longer identically distributed. In this situation, the 
observations will be first transformed to i.i.d. data using (7.12). Next, (7.16) will 
be used in the extrapolation step, yielding an estimator of an extreme quantile of 
the residual distribution. Finally, the quantile estimator of the generalized residuals 
will be transformed back to the original observations by inverting (7.12). This 
results in the following estimator for the (1 — p)-th quantile of Fy\ x - 

( - k -\- \ 、 A( X ) 

Q,Y,k^P\ X) = I Rn—k,n: !~7T~ ) ， k = d 1, . . . , n _ 1, 

V (n-\- \)pj 

with } 4 (x) denoting the estimator for y(x) obtained by using the k largest order 
statistics and R n -k,n representing the (k + l)-th largest order statistic of the gen¬ 
eralized residuals obtained by using 54 (x) in (7.12). 


Example 7.2 (continued) In Figure 7.6, we show the value versus \og(size) scat- 
terplot with the estimated conditional 0.99 quantile obtained at k = 102 super¬ 
imposed. The k value used to compute the 0.99 quantiles is selected so as to 
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Figure 7.5 Diamond data: (a) Pareto quantile plot of the generalized residuals for 
the regression model with \og(size) as explanatory variable, (b) Pareto quantile 
plot of the generalized residuals for the regression model with size as explanatory 
variable (in (a) and (b), the generalized residuals are computed using the /31- 
estimates at k = 200). 
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log(Size) 

Figure 7.6 Diamond data: scatterplot of value versus log(size) with 
Gr, 102 (0.01; \og(size)) superimposed. 


minimize 


^ I (valuei > <2y, a ： (0.01; logisizet))) - 0.01 


with respect to k. 


7.4 The Tail Probability View_Peaks Over 
Threshold (POT) Method 

7.4.1 Model description 

Consider a random variable Z with distribution function F satisfying F g V{G y ). 
Following (C*), the conditional distribution oi Y = Z — u given Z > u can be 
well approximated by the GP distribution at least for threshold values that are 
sufficiently large. On the basis of this, it is natural to model exceedances over a 
high-specified threshold by the GP distribution with distribution function 


H(y, (T, y) 


■(1 + K^rt l + ri>0,y^0, 
- ex P (- 差 ）， y >0,y = 0, 





226 


REGRESSION ANALYSIS 


where cr > 0 is the scale parameter of the GP family. Similar to the approach 
followed with the GEV, we extend the GP distribution to a regression model 
by taking a and/or y as a function of the covariate vector and the vectors of 
regression coefficients, that is, cr(x) = h\(x; ^\) and /(x) = /? 2 (x; ^ 2 ), see, for 
instance, Davison and Smith (1990). 

7.4.2 Maximum likelihood estimation 

Let Y\,... ,Y n be independent random variables and let x, denote the covariate 
vector associated with 7/ such that 

〜 GP(cr(Xi), y(Xi)), i = 

Again we denote the complete parameter vector with so 〆 =(0J, 爲 ). The 
log-likelihood function is then 


logLW = J2 lo - h(Yi ' CT(X ,+) ， KCxO), (7.17) 


where h is the GP density function: 


h(y\a, y)= 


+ 卜、 i + yj >0,y^0, 

l ex P (— 差）， ^ > 0 , / = 0 . 


The maximum likelihood estimator can be obtained by maximizing (7.17) with 
respect to p. Approximate inference about the regression coefficients can be drawn 
on the basis of the limiting normal distribution of the maximum likelihood estimator 
or using the profile likelihood approach. The profile log-likelihood function is 
usually not quadratic in small and moderate samples and provides a better basis 
for confidence intervals than the observed expected information (see Davison and 
Smith (1990)). 

We now turn to the case where the data are not exactly GP distributed. Consider 
a random variable Z with associated covariate vector x such that the conditional 
distribution of Z given x, Fz\x, is in the max-domain of attraction of the GEV, 
Fz\x ^ P(G k ( X )). Here, the notation y(x) stresses the possible dependence of the 
tail index on the covariates. On the basis of (C*), the GP distribution can be used to 
approximate the conditional distribution of Z — u x given Z > u x where u x denotes 
a sufficiently high threshold. Given independent random variables Zi,... ， Z„ and 
associated covariate vectors xi,, x„，the above-described maximum likelihood 
procedure is then applied to the exceedances Yj = Zi — u Xi , provided Z, > u Xi , 
j = 1, ... ， N Ux , where i is the index of the y-th exceedance in the original sample 
and N Ux denotes the number of exceedances over the threshold ‘function’ u x . Of 
course, the covariate vectors need to be re-indexed in an analogous way. Similar 
to the i.i.d. case, applying the GP approach involves the selection of an appropriate 
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threshold u x . In the regression case, the specification of a threshold gets even more 
difficult since, in principle, the threshold can depend on the covariates in order to 
take the relative extremity of the observations into account (see also Davison and 
Smith (1990) and Coles (2001)). Up to now, solutions seem to be more ad hoc 
and depending on the data set at hand. One possibility to a scientifically better- 
founded approach is to proceed as follows. Let y (x) and cr (x) be as defined above 
and u x = u(x; 0) denotes the threshold function, depending on both the covariates 
and a vector of regression coefficients 0, where 0 , = (0\,, Qd). The following 
mixed-integer programming formulation allows to estimate ^2 and 0 for the 
GP regression model such that exactly k observations fall above u x : 

( 1 + 咖)贷)卜-心） 


— l ， ... ， ^z, 

with M a big number. From a computational point of view, however, this approach 
is very difficult. Alternatively, the Koenker and Bassett (1978) quantile regression 
methodology may be used to obtain a covariate dependent threshold. Suppose the 
conditional quantile function Q(p; x) associated with Fz\x can be modelled by 
a completely specified function u(x; 0 P ), that is, Q{p\ x) = m(x; 0 p ). The p-th 
(0 < p < l) quantile regression estimator 0 P of 0 P is then defined as a solution to 
the following optimization problem 

n 

min^2 (p(Zi - u(Xi； 0 P )) + + (1 _ p)(Z t - w(x /； 〜))-）， 

9p i=\ 

with x + = max(0, x) and x~ = max(0, —x). When working with the GP dis¬ 
tribution, the threshold can be set at a particular regression quantile, that is, 
u x = u(x; 0 p ). The estimated conditional quantile function is then used to com¬ 
pute exceedances that are in turn plugged into the maximum likelihood estimation. 
This procedure may be performed for p = k = d-\-l,...,n — l and, similar 
to the i.i.d. case, the point estimates plotted as a function of k. 


max 

^ 1 ^ 2 , 


-loga(x/) - 


、yM 


+ 1 log 


subject to 


Pu p 2 , e G 

Z/ M8i > u Xi , 

_ D = 众， 
^ G { 0 , 1 }, 


Example 7.1 (continued) We illustrate the GP regression modelling of condi¬ 
tional exceedances with the Condroz data. A GP regression model with y(pH)= 
exp(00 + ^ipH) is fitted to the Ca exceedances over both a constant threshold and 
a covariate dependent threshold. The constant threshold is taken as the (k + l)th 
largest observation on the dependent variable, k = 5,..., n — 1. This is illustrated 
in Figure 7.7(a) for k = 20. Alternatively, we used the Koenker and Bassett (1978) 
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Figure 7.7 Condroz data: exceedances over (a) the 20th largest response obser¬ 
vation and (b) regression quantile 0.9877 (corresponding io k = 20). 

quantile regression methodology to obtain a covariate dependent threshold. Here 
we set 
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u p h = exp(0 o ， p + 0 hp pH), 

where Oo, p and 0\^ p denote the pth regression quantile, p = (n — k)/(n 1), 

k = 5,..., n — 1, see Figure 7.7(b). Note that the covariate dependent threshold 
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yields exceedances over the whole range of the covariate, whereas for the constant 
threshold, exceedances exclusively occur at the higher pH levels. In Figure 7.8(a), 
(b) and (c), we show the maximum likelihood estimates of respectively cr, fio and 
for the above-described GP regression model fitted to exceedances over a con¬ 
stant threshold (broken line) and the quantile regression threshold (solid line). 
Finally, the profile log-likelihood function of using k = 500 exceedances over 
the covariate dependent threshold and the corresponding 95% confidence interval 
are shown in Figure 7.8(d). The 95% interval does not contain the value p\ = 0, 
so the hypothesis Hq ：^\ = 0 can be rejected at the 5% significance level. 


7.4.3 Goodness-of-fit 

Similar to the discussion in section 7.2.3, we focus on the use of residual quantile 
plots to assess the fit of a GP regression model. Consider 


〜 GP(cr(Xi), y(Xi)), i = (7.18) 

Since the exponential distribution is a special member of the GP family, it is natural 
to apply a transformation to the exponential case. The transformation (Coles (2001)) 


^log^l + yCx,)^), 

Yj 

a(x, ) ’ 


yM / o, 

y(x,) = 0. 


results in a standard exponential random variable: 


(7.19) 


^Ri |x/ (Ti ) 


I H (^|[exp(y(x,.)n) - 1]; ct(x,), y(x,.)) ， y(Xi) ^ 0, 
I //(cr(x,)r,-; cr(x ; ), y(x ;))， y(x;) = 0, 

1 一 exp(-ri). 


If Y\,... ,Y n are independent, then R\,..., R n are i.i.d. random variables, and 
hence can be used to validate model (7.18)，for instance, using an exponential 
quantile plot 



1 ， 


...,n. 


When regression model (7.18) indeed gives an accurate description of the data, 
we expect the points on the exponential quantile plot to scatter around the first 
diagonal. 


Example 7.1 (continued) We evaluate the fit of the GP regression model with 
y(pH) = exp(^o + P\pH) to the 500 Ca exceedances of regression quantile 
0.6667 using an exponential quantile plot. This plot is shown in Figure 7.9. The 
ordered residuals scatter quite well around the first diagonal, giving evidence of a 
good fit of the GP regression model to the Ca exceedances. 
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Figure 7.8 Condroz data: (a) a, (b) fio, (c) as a function of k for GP regressio 
model with constant threshold (broken line) and covariate dependent threshoL 
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Figure 7.9 Condroz data: exponential quantile plot of the generalized residuals 
(7.19) at 众 = 500. 


after replacing the unknown parameter functions by their respective maximum 
likelihood estimates. In case the GP distribution was used as an approximation to 
the conditional distribution of Z — u x given Z > u x , then, on the basis of (C*), 
setting z ：= u x -\- y, for u x z* 


Fz\x(z) 




~yh) 


户 z| X (w x )exp (— 洁)， 


/(x) ^ 0, 

y(x) = 0. 


(7.20) 


Solving (7.20) for z and replacing the unknown quantities by their respective point 
estimates yields 


[>* 




Kx) log 


K(x)^0, 
K(x) = 0. 


If the covariate dependent threshold is set at a non-extreme regression quantile 
obtained with, for instance, the quantile regression methodology of Koenker and 
Bassett (1978), that is, u x = x), then the above expression reduces to 


u* 



Ux + Wi ((t) ylx> -1) ， y(x)#o, 
M x -CT(x)log 乎， y(x) = 0. 


(7.21) 
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Figure 7.10 Condroz data: 汐 *(1000; pH) ai k = 500 (solid line) and regression 
quantile 0.6667 (broken line) as a function of pH. 

Example 7.1 (continued) In Figure 7.10, we show the Ca versus pH scatterplot 
together with t/*(1000; pH) (solid line) ai k = 500. The broken line represents 
the threshold that is here set to regression quantile 0.6667. 


7.5 Non-parametric Estimation 

The methods considered so far all require the specification of a functional form 
for the model parameters. In practice, this often turns out to be a hard job. More¬ 
over, completely parametric models are often smoother than a visual inspection 
of the data would suggest, and their lack of flexibility can lead to models with 
large numbers of parameters still providing poor fits. As an alternative to the 
parametric models discussed in the previous sections, the approach taken here is 
non-parametric, that is, we let the data themselves describe the functional rela¬ 
tionship for the model parameters. In this section, we focus on modern smoothing 
techniques such as maximum penalized likelihood estimation (Green and Silver- 
man (1994)) and local polynomial maximum likelihood estimation (Fan and Gijbels 
(1996)) and combine these with the GP distribution as approximate model for 
exceedances over high thresholds. Although we restrict the discussion to the GP 
modelling of exceedances, the non-parametric procedures may be combined with 
the GEV equally well. In this respect, some relevant references are Davison and 
Ramesh (2000), Hall and Tajvidi (2000a) and Pauli and Coles (2001). 
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Consider a random variable Z with associated covariate x such that Fz\ x ^ 
V(G y ( X )). For simplicity of notation, we restrict the discussion to the single covari¬ 
ate case. Following (C*), the conditional distribution of Y = Z — u x given Z > u x 
can be well approximated by the GP distribution, at least when the threshold u x is 
set sufficiently high. 


7.5.1 Maximum penalized likelihood estimation 

Given independent observations Zi,... ， Z„ and associated co variates x\ < < 

x n , we fit a GP regression model to the exceedances Yj = Zi — u Xi , provided 
Zi > u Xi , j = l,..., N Ux . Thereby, we do not impose a particular functional form 
that describes how the parameters o and y depend on the co variate. Note that the 
covariate is re-indexed in the sense that Xi denotes the x observation associated 
with exceedance F/. Take G[ = exp(s(x/)) and yt = t(xi) with s and t unknown 
functions. The purpose of the notation is to stress that the at and yi are parameters 
whose similarity over the covariate space is determined by the smoothness of the 
continuous functions s and t, which are assumed to be twice differentiable over 
[x\, x n ]. The penalized log-likelihood function is defined as 


Nu x 

n(M) = ^2\ogg(Yi ； expo ⑹)， t(xi)) - 


(/’(x)) 2 dx 



J (t n (x)) 2 dx, 


(7.22) 


where g is the GP density function. The penalized log-likelihood function is a 
difference of two terms. The first term is the classical log-likelihood function 
for the GP distribution, the second term is a penalty function whose magnitude 
reflects the integrated roughness of the functions s and t. The amount of smooth¬ 
ing is determined by the parameters k\ and 入 2 . For small 入 i and 入 2 ， the (over 
parametrized) log-likelihood dominates n, leading to estimates that follow the 
data closely. Increasing 入 1 and A ,2 results in larger penalties and hence produces 
smoother fits. Computing the maximum penalized likelihood estimates involves 
the maximization of n over the entire functional space of s and t. However, using 
the fundamental theorems concerning natural cubic splines given in sections 2.1 
and 2.2 of Green and Silverman (1994)，it can be shown that the maximization of 
(7.22) is equivalent to the maximization of a finite dimensional system correspond¬ 
ing to the ct/ and i = 1 ,... ， N Ux , followed by a cubic spline fit to construct the 
complete s and t curves. 

Using the notation of Green and Silverman (1994), define band matrices Q 
and R as follows. Let hi = — Xi, i = 1 ， . .. ， N Ux — 1. Define 2 as a N Ux x 

(N Ux — 2) matrix with elements qtj, i = 1,..., N Ux , j = 2, , N Ux — 1, given 
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h J-V 

ifz. = j - 1 ， 



-h~\ if i = j, 

(lij = 


J J 

if/= 7 + 1, 


0 , 

otherwise, 


and as a (N Ux — 2) x (N Ux — 2) symmetric matrix with elements rtj, i, j = 
2 , … ， N Ux — u given by 



if/= 7 , 

\hi, ifs / = 7 - 1 or / = y + 1 , 

0 , otherwise. 


Finally, define K = QR~ l Q r . Maximization of (7.22) with respect to s and t is 
equivalent to the maximization with respect to s and t of 


Nu x 


n(s，t) = ^loggiYi ； expOO,))， t(xi)) - -Xis'Ks- Kt, 


where s’ = ( 5 (x 1 ),..., and t’ = (t (xi),..., t(xN Ux )), followed by a cubic 

spline fit to link the estimates together. The precision of the maximum penalized 
likelihood estimators s and t can be assessed by the bootstrap (Chavez-Demoulin 
(1999), Chavez-Demoulin and Davison (2001)) or on the basis of a Bayesian inter¬ 
pretation of the penalized likelihood function (Wahba (1978)，Green and Silverman 
(1994)，Pauli and Coles (2001)). 


Example 7.1 (continued) In section 7.4, the Condroz data were analysed by fitting 
the regression model GP(cr, exp()i3o + fiipH)) to the conditional Ca exceedances. 
Here, maximum penalized likelihood estimation will be used to obtain a non- 
parametric estimate of y(pH). Similar to the analysis in section 7.4, we take a 
constant scale parameter a. The maximum penalized likelihood estimates a and t 
are obtained by maximizing 

N u pH 1 

n(ff, t) = H logg(Yi； a, t\p Hi)) - -XiKt 
i=l 

with respect to o and t. Note that maximum penalized likelihood estimation easily 
accommodates fully parametric specifications for some of the model parameters. 
In Figure 7.11(a), we show the 50 exceedances over regression quantile 0.9673. 
Figure 7.11(b) contains the maximum penalized likelihood estimates of y(pH) 
for three different values of 入 together with the estimate obtained from fitting the 
parametric GP(cr, exp(^o + regression model. The corresponding results 

for the 200 exceedances over regression quantile 0.866 are shown in Figure 7.11(c) 
and (d). Note how increasing the parameter 入 ， and hence increasing the penalty 
assigned to roughness, leads to smoother y estimates. 



5.0 5.5 6.0 6.5 7.0 7.5 

pH 

(b) 

Figure 7.11 Condroz data: (a) exceedances over regression quantile 0.9673 
(k = 50) versus pH and (b) maximum penalized likelihood estimates of y(pH) 
for 入 = 0.1 (broken line), 入 = 0.01 (broken-dotted line) and 入 = 0.001 (dotted 
line) together with the estimate obtained with the parametric regression model 
GP(<y, exp(/3o + fiipH)) (solid line). Figures (c) and (d) show the results obtained 
with a threshold taken as regression quantile 0.866 (k = 200). 
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Similar to the parametric GP regression modelling, the fit of the maximum 
penalized likelihood estimate can be assessed by a visual inspection of the expo¬ 
nential quantile plot of the generalized residuals (7.19). Non-parametric estimates 
for extreme conditional quantiles can be obtained from (7.21)，thereby replacing 
a{x) and y(x) by their maximum penalized likelihood estimates. 
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Example 7.1 (continued) We evaluate the fit of the maximum penalized likelihood 
estimation with 入 ： = 0.1 at A: = 200 by means of an exponential quantile plot of 
the generalized residuals (7.19), see Figure 7.12(a). The residuals scatter quite well 
around the first diagonal, indicating an appropriate fit of the Ca exceedances by the 
maximum penalized likelihood estimates. In Figure 7.12(b), we show the Ca versus 
pH scatterplot together with f/*(1000; pH) at k = 200. Here, f/*(1000; pH) is 
obtained by plugging the maximum penalized likelihood estimates obtained with 
入 = 0.1 into (7.21). 


7.5.2 Local polynomial maximum likelihood estimation 

Alternatively, the parameter functions o and y can be estimated by repeated local 
fits of the GP distribution. Consider independent random variables Zi,... ， Z„ and 
associated covariate observations x\,..., x n . Suppose we are interested in estimat¬ 
ing o and y at x^. Fix a high local threshold u x * and compute the exceedances 
Yj = Z[ — u x *, provided Z/ > u x *, j = 1, … ， N u ” where i is the index of the j- 
th exceedance in the original sample and N Ux * denotes the number of exceedances 
over the threshold u x *. Re-index the covariates in an appropriate way such that 
xi denotes the covariate observation associated with exceedance Yi. Let h denote 
a bandwidth parameter. Since a and y are unknown, we approximate them by 
polynomials centred at x*. Indeed, assuming o and y are p\ and p 2 respectively, 
times differentiable we have, for |^- — ^*| < h, 

pi 

^(Xi) = - x*) j +o(h Pl ), 

7=0 

P2 

y(Xi) = fey fe - X*) j -\-o(h P2 ), 
j=o 


where 戸 ] j 


d J (T(Xj) 


dxj 


and p 2 j 


^ J y(xj) 


0, … ， /? 2 . The coefficients of these approximations can be estimated by local 
maximum likelihood fits of the GP distribution. Thereby, the contribution of the 
observations to the log-likelihood is governed by a kernel function K, where K is 
such that observations close to x* receive more weight. Further, K is assumed to be 


a symmetric density function on [—1 ， 1] and h rescales K as Kh(x) = K{x/h)/h. 
Clearly, h determines the amount of smoothing. The local polynomial maximum 

A, A, 八 A A 八 

likelihood estimator (^J, P 2 ) = (^io,..., ^\ PI , 卢 20 , ... ， P2p 2 ) i s the maximizer of 
the kernel weighted log-likelihood function 


L N u , (灼，卢 2) 




PI 


P2 


I 


^2 logg Yi ； -xy , - x*y K h (xi - x*) 


7=0 
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Figure 7.13 Condroz data: local polynomial maximum likelihood estimates of 
y(pH) obtained with a normal kernel function, pi = 0, p 2 = l, k = 75 and h = 
0.3 (solid line), h = 0.5 (broken line), h = 0.7 (broken-dotted line). 


with respect to p f 2 ) = , ^\ pi , .. • ， 02/? 2 )， where g denotes the GP 

density. Note that local polynomial fitting provides estimates of y{x") and cr(x*) 
and their derivatives up to order p\ respectively p 2 . Beirlant and Goegebeur 
(2004a) proved consistency and asymptotic normality of the local polynomial max¬ 
imum likelihood estimator in case y(x) > 0. 

Example 7.1 (continued) In Figure 7.13, we show the local polynomial maximum 
likelihood estimates of y(pH) obtained with the above-described procedure with a 
normal kernel function, p\ =0, p 2 = 1 and h = 0.3 (solid line), h = 0.5 (broken 
line) and h = 0.7 (broken-dotted line). In this analysis, we set the local threshold 
at the 76th largest response observation within each window, so A: = 75. 

Consistent with this local approach, Hall and Tajvidi (2000a) proposed to use 
local quantile plots as a basis for the goodness-of-fit evaluation. Consider a window 
centred at x* with length 2h and let (7(, x [),..., (7^, x r k ) denote the observations 
{Yi, Xi) for which Xi e [x* — h, x* h]. Given a local polynomial fit, we trans¬ 
form (F(, xj), …， (Y【, x f k ) into generalized residuals (7.19), thereby replacing the 
unknown parameter functions by their polynomial approximation, and use these 
to construct an exponential quantile plot. Non-parametric estimates of extreme 
quantiles of Fz\ x * can be obtained from 
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with n* the number of observations in [x* — h, x* h], k the number of 
exceedances receiving positive weight and and y(x*) denoting the local 

polynomial maximum likelihood estimates for respectively cr(x*) and y(x*). 

Example 7.1 (continued) We evaluate the local polynomial fit of the GP distribu¬ 
tion with h = 0.5 and /: = 75 at pH* = 7 using a local exponential quantile plot 
of the generalized residuals, see Figure 7.14(a). In Figure 7.14(b), we show the 
threshold (broken line), which is set here at the 76th largest response observation 
within each window of length 2h, and t/*(1000; pH) obtained with h = 0.5 and 
A: = 75 as a function of pH. 


7.6 Case Study 

Insurance companies often use reinsurance contracts to safeguard themselves 
against portfolio contaminations caused by extreme claims. In an excess-of-loss 
reinsurance contract, the reinsurer pays for the claim amount in excess of a given 
retention. For the reinsurer, accurate description of the upper tail of the claim 
size distribution is of crucial importance for competitive price setting. In this pro¬ 
cess, taking covariate information into account allows to differentiate premiums 
according to the risks involved. 

In this section, we illustrate how parametric and non-parametric GP modelling 
of conditional exceedances may help in describing tails of conditional claim- 
size distributions. Consider the AoN Re Belgium fire portfolio data introduced in 
section 1.3.3. In Figure 7.15(a), we show the claim size versus \og(sum insured) 
(log(5/)) scatterplot of claims generated by the office buildings portfolio. Given 
this point cloud with some really large claims for 7 < \og(SI) < 10, we propose 
to use the covariate dependent threshold 

log(M S/ ) = d 0 ,p + 6i P log(5/) + e 2 ,p log 2 (57) 

where 0q ， p ， 0\^ p and 02 , p (0 < p < l) are estimated using the quantile regression 
methodology of Koenker and Bassett (1978). Figure 7.15(b) contains the regression 
quantiles p = 0.9116 (60 exceedances) and p = 0.7875 (150 exceedances). The 
exceedances over these regression quantiles are shown in Figure 7.15(c) and (d). 

A GP(cr(SI), y(SI)) regression model with \og(cr(SI)) = #i，o + 3i’i ^og(SI) 
+ ySi,2log 2 (*S7) and log(y(5/)) = Pi,o + log(5/) + 卢 2 , 2 log 2 (57) is fitted to 

both sets of exceedances. In Table 7.1 and Table 7.2, we show the resulting parame¬ 
ter estimates together with the value of the log-likelihood function for the full model 
and some reduced models. The reduced models are obtained by sequentially remov¬ 
ing non-significant parameters. Significance is decided upon by performing a clas¬ 
sical likelihood ratio test. For instance, ai k = 60, the likelihood ratio test statistic 
for testing the hypothesis Ho : = 0 equals 2(143.8726 — 143.8677) = 0.0098. 

This value is smaller than the critical value x^(0.95) = 3.8415 and hence Hq can¬ 
not be rejected at significance level a = 0.05. In this way, removing non-significant 
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Figure 7.15 AoN Re Belgium data: (a) scatterplot of claim size versus log(57 )， 
(b) \og{claim) versus \og(SI) with regression quantile 0.7875 (broken line) and 
regression quantile 0.9116 (solid line) superimposed, (c) exceedances over regres¬ 
sion quantile 0.9116 (k = 60) and (d) exceedances over regression quantile 0.7875 
(k= 150). 
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Table 7.1 AoN Re Belgium data: GP modelling dXk = 60. 


Model 

A,o 

ki 

^1,2 

^2,0 

^2,1 

^2,2 

logi 

I 

-4.4069 

-0.0564 

0.0080 

-35.7814 

8.7077 

-0.5238 

143.8726 

II 

-5.0653 

0.0927 

0 

-35.9648 

8.7633 

-0.5280 

143.8677 

III 

-4.2476 

0 

0 

-35.7165 

8.6223 

-0.5150 

143.5624 

IV 

-4.2806 

0 

0 

0.3177 

—0.0479 

0 

141.3641 



Table 7.2 

AoN Re Belgium data: GP modelling dX k = 150. 


Model 

^1,0 

ki 


声 2,0 

^2,1 

^2,2 

logL 

I 

-13.0059 

1.9757 

-0.1262 

-1.8591 

0.3018 

—0.0078 

513.6545 

II 

-13.3194 

2.0426 

-0.1297 

-1.1827 

0.1538 

0 

513.6464 

III 

-14.4144 

2.2034 

-0.1330 

0.1181 

0 

0 

512.6212 

IV 

-4.1386 

-0.1666 

0 

0.1091 

0 

0 

509.9120 


parameters one by one, we finally obtain a GP regression model with \og(cr(SI))= 
^i,o and log(y (57)) = 02, o + log(SI) + ^ 2,2 log 2 (5/) (model III) a.t k = 60 
and with \og(cr(SI)) = ^i, 0 + ^i,i log(5/) + fii^og 2 (SI) and log(y(5/)) = ^ 2 ,o 
(model III) at k = 150. The final parameter functions are shown in Figure 7.16. 
At k = 150, the tail dependence on the covariate SI is modelled through the scale 
parameter of the GP distribution while at k = 60, that is, deeper in the conditional 
tails, tail dependence goes through the extreme value index. We evaluate the fit of 
both GP regression models based on an exponential quantile plot of the generalized 
residuals (7.19), see Figure 7.17. Both plots show residuals that scatter quite well 
around the first diagonal indicating a reasonable fit of the respective GP regression 
models. 

Generally, the estimation of the extreme value index is not a goal on its own and 
is often performed as a kind of in-between step when ultimate interest is in extreme 
conditional quantiles or small conditional exceedance probabilities. Likewise, the 
primary interest of a reinsurer will not focus on the extreme value index estimates 
but rather on the claim level that will be exceeded only once in, say, 1000 claims, 
thereby taking into account the possible influence of covariate information. In 
Figure 7.18, we show the claim size versus SI scatterplot with the estimated 0.995 
conditional quantile aik = 60 (solid line) and k = 150 (broken line) superimposed. 
At ^ = 60, the extreme value index was found to vary significantly with SI. This is 
reflected here in extreme conditional quantile estimates that follow the data better 
than the estimates obtained 3.t k = 150. 

As a final step, we analyse the AoN Re Belgium claim data in a non-parametric 
way. We restrict the non-parametric analysis to the 60 exceedances over regres¬ 
sion quantile 0.9116 and fit a GP(cr, y(SI)) regression model using maximum 
penalized likelihood estimation. Figure 7.19 contains the maximum penalized like¬ 
lihood estimates of y(SI) for 入 = 0.1 (solid line), X = 0.05 (broken-dotted line), 
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Figure 7.19 AoN Re Belgium data: maximum penalized likelihood estimates of 
y(SI) obtained at 众 = 60 for 入 = 0.1 (broken line), 入 = 0.05 (broken-dotted line) 
and X = 0.01 (dotted line) together with the estimate obtained with the parametric 
regression model GP(cr, exp(/3o + log(57) + log 2 (5 , /))) (solid line). 
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Figure 7.18 AoN Re Belgium data: t/*(200; SI) at /: = 60 (solid line) and k = 
150 (broken line) as a function of log(57). 
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Figure 7.20 AoN Re Belgium data: maximum penalized likelihood estimation 
at k = 60: (a) exponential quantile plot of the generalized residuals (7.19) with 
X = 0.05 and (b) ^*(200; SI) for 入 = 0.1 (broken line), X = 0.05 (broken-dotted 
line), X = 0.01 (dotted line) and parametric estimate (solid line). 
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X = 0.01 (dotted line) and the parametric estimate obtained before (solid line). 
Remark that increasing 入 ， and hence giving more weight to the roughness penalty, 
produces smoother fits. Besides providing an alternative estimation procedure here 
the penalized likelihood estimates strongly confirm the previously performed com¬ 
pletely parametric analysis. We evaluate the fit of the non-parametric estimate on 
the basis of the exponential quantile plot of the generalized residuals (7.19), see 
Figure 7.20(a). Non-parametric estimates of f/*(200; SI) can be obtained from 
(7.21)，thereby replacing the unknown parameters by the maximum penalized 
likelihood estimates. In Figure 7.20(b), we show the claim size versus log(57) scat- 
terplot with t/*(200; SI) for the three values of 入 considered before superimposed. 
Again, the non-parametric analysis confirms the previously obtained completely 
parametric results. 



8 

MULTIVARIATE EXTREME 
VALUE THEORY 


8.1 Introduction 

Many problems involving extreme events are inherently multivariate. Gumbel and 
Goldstein (1964) already investigate the maximum annual discharges of the Ocmul- 
gee River in Georgia at two different stations located upstream and downstream. 
Coles and Tawn (1996a) and Schlather and Tawn (2003) undertake a spatial analysis 
of daily rainfall extremes in south-west England in the context of risk assessment 
for hydrological structures such as reservoirs, river flood networks and drainage 
systems. De Haan and de Ronde (1998) and de Haan and Sinha (1999) estimate 
the probability that a storm will cause a certain sea-dike near the town of Pet- 
ten, the Netherlands, to collapse because of a dangerous combination of sea level 
and wave height. In a financial context, Starica (1999) analyses the occurrence 
of joint extreme returns in pairs of exchange rates of various European curren¬ 
cies (in the pre-euro era) versus the US dollar, while Longin and Solnik (2001) 
investigate the dependence between international equity markets in periods of high 
volatility. Surprisingly, multivariate techniques also come into play in the analysis 
of univariate time series, for instance, in the construction in Smith et al. (1997) 
of a Markov model for the extremes of a series of daily minimum temperatures 
recorded at Wooster, Ohio. These and many other examples demonstrate the need 
for statistical methods for analysing extremes of multivariate data. 

Already at a first attempt of imagining how a statistical methodology for mul¬ 
tivariate extremes could look like, we stumble upon a fundamental difficulty: what 
exactly makes a multivariate observation ‘extreme’ ？ Is it sufficient that just a single 
coordinate attains an exceptional value, or should it be extreme in all dimensions 
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simultaneously? More technically, what meaning to attach in a multivariate set¬ 
ting to concepts such as order statistics, sample maximum, tail quantiles, threshold 
exceedances, which are all so useful in univariate extreme value statistics? The 
answers to these questions may depend on the situation at hand. 

A fundamentally new issue that arises when there is more than one variable 
is that of dependence. How do extremes in one variable relate to those in another 
one? What are the possible dependence structures? And how to estimate them? As 
in the univariate case, one of the aims of the exercise is to extrapolate outside the 
range of the data. When more than one variable is involved, we can only hope to 
reliably do so if we take proper account of the possibility of extremes in several 
coordinates to occur jointly. 

The study of multivariate extremes, then, splits apart into two components: the 
marginal distributions and the dependence structure. This distinction is reflected in 
both theory and practice. Typically, first, the margins are dealt with, and second, 
after a transformation standardizing the margins to a common scale, the depen¬ 
dence. The first step merely involves the univariate techniques developed in the 
previous chapters. The second step, however, is new. 

We will discover that the class of possible limiting dependence structures cannot 
be captured in a finite-dimensional parametric family. This is a major setback in 
comparison to the univariate case, where we could rely on parametric techniques 
based on the GEV or GP distributions. This time, we will either have to shift to 
non-parametric techniques or construct sensible parametric models. 

There exists a great variety of equivalent descriptions of extreme value depen¬ 
dence structures, and although each of them has its own merits, this multitude of 
sometimes seemingly unconnected approaches may cause confusion and hamper 
the flow from theory to practice. It is one of the aims of this text to fit the pieces 
together and give the reader a panoramic view of the state of the art. Some new 
insights and results form a pleasant by-product of this unification exercise. 

The text on multivariate extremes is divided into two chapters. In the present 
chapter, we explore the probability theory on extremes of a sample of independent, 
identically distributed random vectors. This forms the necessary preparation for 
the statistical methodology in the next chapter. All in all, the material is vast, 
and a complete coverage of the literature would have filled a book by itself. The 
interested reader may find further reading in, for instance, Galambos (1978, 1987), 
Resnick (1987), Kotz and Nadarajah (2000), Coles (2001), Drees (2001), Reiss 
and Thomas (2001), Fougeres (2004), and of course the many papers cited in the 
next two chapters. 

The outline of the present chapter is as follows. In the remainder of this intro¬ 
duction, we formulate the multivariate version of the domain-of-attraction problem, 
which, as in the univariate case, is a convenient starting point. In section 8.2, we 
study multivariate extreme value distributions, focusing mainly on their depen¬ 
dence structure, while section 8.3 describes their domains of attraction. Some 
additional topics are briefly touched upon in section 8.4. Section 8.5 summarizes 
the essential things to know before attacking the statistical issues in Chapter 9. 




MULTIVARIATE EXTREME VALUE THEORY 


253 


Finally, the appendix (section 8.6) includes, amongst others, a directory of formu¬ 
las connecting the various equivalent descriptions of multivariate extreme value 
distributions. 


The multivariate domain-of-attraction problem 


The road from univariate to multivariate extreme value theory is immediately con¬ 
fronted with an obstacle: there is no obvious way to order multivariate observations. 
Barnett (1976) considers not less than four different categories of order relations 
for multivariate data, each being of potential use. The most useful order relation 
in multivariate extreme value theory is a special case of what is called marginal 
ordering : for ^/-dimensional vectors x = (xi, … ， Xd) and y = (ji,..., yd), the 
relation x < y is defined as xj < yj for all j = 1 ， … ， d. Unlike in one dimen¬ 
sion, not every two vectors can be ordered in this way — imagine two bivariate 
vectors, one at the upper-left of the other. The component-wise maximum of x 
and j, defined as 


xvy:=(xivyi,...,x d v y d ), 

is in general different from both x and y. 

Consider a sample of ^-dimensional observations, Xf = (X/i,... ， for 
i = ,n. The sample maximum, M n , is now defined as the vector of 

component-wise maxima, that is, the components of M n = V?=i Xi are given by 

n 

^n,j = Xi ， j ， j = 1, d. 

i=\ 

Observe that the sample maximum need not be a sample point; in this sense, the 
definition might seem artificial. Still, from its study a rich theory emanates that 
leads to a broad set of statistical tools for analysing extremes of multivariate data. 

Of course, we could just as well study the component-wise minimum rather 
than the maximum. But clearly, just as in the univariate case, results for one of 
the two can be immediately transferred to the other through the relation 

n n 

/ \X i = -\/(-X i ). 

i=l i=l 

Therefore, we can, without loss of generality, focus on maxima alone. Notations 
will be greatly simplified if we adopt the following convention: unless mentioned 
otherwise, all operations and order relations between vectors are understood to be 
component-wise. Observe that we have already employed this convention in the 
definitions above of 4 < 5 and ‘V’ for vectors. 

The distribution function of the component-wise maximum, M n , of an inde¬ 
pendent sample X\,..., X n from a distribution function F is given by 

P[M n < x] = P[X\ < x,..., X n < x] = F n (x), x e R d . 
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Like in the univariate case, we will need to normalize M n in some way in order 
to get a non-trivial limit distribution as the sample size tends to infinity. The 
domain-of-attraction problem then reads as follows: find sequences of vectors, 
(a n ) n and (b n ) n , where a n > 0 = (0, •. •, 0), such that a~ l (M n — b n ) converges 
in distribution to a non-degenerate limit, that is, such that there exists a variate 
distribution function G with non-degenerate margins such that 

F n (a n x + b n ) ^ G(x), n —> oo. (8.1) 

If (8.1) holds, we say that F is in the (max-)domain of attraction of G, notation F g 
D(G). Moreover, G is called a (multivariate) extreme value distribution function. 

The study of equation (8.1) then splits into two parts: (i) characterize the class 
of extreme value distribution functions, and (ii) for a given extreme value distri¬ 
bution function, characterize its domain of attraction. We will take up these parts 
separately in the next two sections. 

Before we start off, there is a simple but crucial observation to be made. Let 
Fj and Gj denote the jih marginal distribution functions of F and G respectively. 
Recall that, by assumption, Gj is non-degenerate. Since a sequence of random 
vectors can only converge in distribution if the corresponding marginal sequences 
do, we obtain for j = 1 ,..., 


x> 

Fj(a n jXj + b n j) Gj(xj), n ^ oo. 

Therefore, each Gj is by itself a univariate extreme value distribution function and 
Fj is in its domain of attraction. This has been extensively studied in Chapter 2. In 
the present chapter, then, we can focus on the dependence structures of F and G. 

A final remark: since the marginal distributions of G are continuous, G itself 
is continuous, so that the convergence in (8.1) holds not only in distribution but 
also for every x g [—oo, oo] — even uniformly. 


8.2 Multivariate Extreme Value Distributions 

Unlike the univariate case, multivariate extreme value distributions cannot be rep¬ 
resented as a parametric family indexed by a finite-dimensional parameter vector. 
The reason is that the class of dependence structures is too large. Instead, the fam¬ 
ily of multivariate extreme value distributions is indexed, for instance, by a class 
of convex of functions, or, in another description, by a class of finite measures. 

8.2.1 Max-stability and max-infinite divisibility 

Let us start from equation (8.1). From the theory on univariate extremes in 
Chapter 2, we know that for positive integer k, there exist vectors cck > 0 and 
P k such that a~ l a n k otk and a~ l (b n k — b n ) —> p k as n ^ oo. But since, for 
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positive integer k and x G M. d , also F nk {a n kX + b n k) —> G(x) as n ^ oo as well 
as a nk x + b nk = a n {a~ x a nk x + a~\b nk - b n )} + we obtain 

G k (a k x-^p k ) = G(x), x eR d . (8.2) 

A J- variate distribution function G such that for every positive integer k we 
can find vectors > 0 and P k such that (8.2) holds is called max-stable. The 
meaning is the same as in the univariate case: If Y,Y\, Y2, ... are independent 
random vectors with distribution function G, then 



Clearly, a max-stable distribution function is in its own domain of attraction; 
in particular, it must be an extreme value distribution function. This argument 
together with the previous paragraph shows that the classes of extreme value and 
max-stable distribution functions actually coincide. 

A consequence of (8.2) is that G 1 ^ is a distribution function for every positive 
integer k, that is, G is max-infinitely divisible (Balkema and Resnick 1977). In 
particular, there exists a measure, /x, on [— 00 , 00 ), such that 

G(x) — exp{—/x([—oo, 00 ) \ [— 00 , jt])}, x g [— 00 , 00 ], (8.3) 


whence the name exponent measure. 

The exponent measure, /x, is in general not unique, and in the future we will 
always use the following particular choice. For j = 1,..., let qj be the lower 
end-point of the yth margin, Gj, of G, that is, % = inf{x g M : Gj(x) > 0}. Define 
q = (q\ ，…， q d ). Then, as G(x) = 0 for jc 〆 分 ， there exists an exponent measure 
\x that is concentrated on [q, 00 ) \ {q}. Moreover, there is only one such exponent 
measure. 


8.2.2 Exponent measure 

Reduction to standard Frechet margins 


To study the dependence structure of a max-stable distribution, it is convenient 
to standardize the margins so that they are all the same. The precise choice of 
marginal distribution itself is not so important. Still, a particularly useful choice is 
that of standard Frechet margins, as in that case the exponent measure must satisfy 
a useful homogeneity property. Connections with other choices for the margins 
employed in the literature are discussed in section 8.2.6. 

Let Gy denote the quantile function of Gj, the yth margin of the max-stable 
distribution function G, that is, (p) = x if and only if G ； (x) = p, where 
0 < p < l. Observe that if, in the usual parametrization, 


1 + }// 


x i - 




Gj(xj) = exp 


Xj g R d , 


(8.4) 
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for some yj g M, fij G M and cry > 0, then 

w A-l 

Gy (e _1/ ^) = [ij + a J~ L y . —， 0 < < oo, 

with appropriate interpretations if = 0. 

Let F be a random vector with distribution function G, and let be the 
distribution function of (—1/log , — 1/logGj(F^)), that is, 

G*(z) = G{Gr(e- 1/zi ),..., Gf (e- 1 ^)}, z G (0, oo). (8.5) 

Then the margins of G* are standard Frechet, as P[—l/logGj(Yj) < z] = e— 
for 0 < z < oo. Conversely, 

GO) = G^J-l/logGi^i),-1/log G d (x d )}, x G R d , (8.6) 

where, taking appropriate limits, — 1/ log(0) := 0 and — l/log(l) := oo. 

Not only does have max-stable margins, it is itself max-stable as well: Since 
G k j(akjXj + fikj) = Gj(xj) for every j = 1,..., k = 1,2,..and xj G M, it 
follows that 

Gl(kz) = G*(z), zeR d ; k=l,2,... 

In particular, G\{kz) = G^(mz) for arbitrary positive integer k and m, and thus 
G^(rz) = G*(z) for all positive rational r. By continuity, 

Gl(sz) = G*(z), z e 0 < s < oo. (8.7) 

An extreme value distribution (function) with standard Frechet margins is some¬ 
times called simple. 

Exponent measure 

Let /x* be an exponent measure of the simple extreme value distribution function 
(7*. Without loss of generality, we can assume that /x* is concentrated on [0, oo) \ 
{0}, so that 

y*U) := - log G*(z) = M*([0, oo) \ [0, z]), z e [0, oo], (8.8) 

Observe that V*(z) = oo as soon as Zj = 0 for some j = l,..., d. Also, since the 
margins of are standard Frechet, 

V*(oo,..., oo, Zj, oo, oo) = ^({x g [0, oo) : Xj > Zj}) = zj 1 (8.9) 
for all j = l,..., d and 0 < Zj < oo. 

The exponent measures \i and of G and G* are related in the following 
way. For x e [q, oo] and z G [0, oo] related by Zj = — 1/log Gj(xj), 

/ x([ 《， oo) \ [ 穿 , x]) = — logG(x) 

=-log G*(z) = /x*([0, oo)\[0, z]). (8.10) 
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Equation (8.7) now implies 

oo) \ [ 0 , z])) = /x*([ 0 , oo) \ [ 0 , z]), z G [ 0 , oo); 0 < 5 < oo. 

By a measure-theoretic argument, this homogeneity relation actually holds for all 
Borel subsets of [0, oo) \ {0}, that is, 

.) = V *(.)， 0 < 5 < oo. ( 8 . 11 ) 

Stable tail dependence function 

The stable tail dependence function is defined by 
l(v) = V^(l/vu \/v d ) 

= M Jt ([ 0 ,oo]\[ 0 ,(l/i; 1 ,...,l/i; d )]) ; i ； e [ 0 , oo] ( 8 . 12 ) 

(Huang 1992). In terms of the original max-stable distribution function G, it is 
given by 

/(!；) =-log … ， G^-(e-^)}, r G [0, oo]. (8.13) 

Conversely, we can reconstruct a max-stable distribution function G from its mar¬ 
gins G j and its stable tail dependence function / through 

-logG(x) = / {-logGi(xi),- log G d (x d )}, x G R d . (8.14) 

By (8.9) and (8.11), a stable tail dependence function / has the following 
properties: 

(LI) l(s •) = sl(-) for 0 < 5 < oo; 

(L2) l(ej) = 1 for j = 1,..., J, where ej is the yth unit vector in M. d ; 

(L3) v\ v • • • v Vd < l(v) < v\ -\ - 1 - Vd for v e [0, oo). 

The upper and lower bounds in (L3) are itself valid stable tail dependence func¬ 
tions: the lower bound, l(v) = v\ v • • • v Vd, corresponds to complete dependence, 

G(x) = G\(x\) A • • • A Gd(xd), whereas the upper bound, l(v) = v\ - h Vd, 

corresponds to independence, G(x) = G\(x\) - - - Gj(x^). Moreover, from (8.23) 
below it follows that 

(L4) / is convex, that is, l[kv + (1 — X)w] < Xl(v) + (1 — X)l(w) for X g [0, 1]. 

Note that, except for the bivariate case, properties (LI) to (L4) do not char¬ 
acterize the class of stable tail dependence functions, that is, a function / that 
satisfies (LI) to (L4) is not necessarily a stable tail dependence function. As a 
counter-example in the trivariate case, put l(v\, V 2 , 1 ^ 3 ) = (v\ + V 2 ) v (V 2 + V 3 ) v 
(i ；3 + i ； i). Properties (LI) to (L4) are clearly fulfilled. Still, l cannot be a stable tail 
dependence function, because 1(1 ， 1 , 0 ) = 1(1, 0 , 1 ) = 1(0, 1 , 1 ) = 2 would imply 
pairwise independence and thus, as we will see in section 8.2.4, full independence, 
in contradiction with 1(1, 1, 1) = 2 ^ 3. 
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8.2.3 Spectral measure 

The homogeneity property (8.11) of the exponent measure /x* yields a versa¬ 
tile representation in terms of (pseudo-)polar coordinates. We start from two 
arbitrary norms, || • ||i and || . H 2 , on W 1 . Typical choices include the L p - 
norms ||jc|| = (\x\\ p + • • • + \xd\ p ) x ^ p for 1 < p < 00 or the max-norm ||i|| = 
max(|xiI,..., \xd\), see below. Let S 2 = {w e : ||w "2 = 1} be the unit sphere 
with respect to the norm || . II 2 . Define the mapping T from W 1 \ {0} to (0, 00) x 
S 2 by 

T(z) = (r, (o), where r = ||z||i and co = z/\\zh, (8.15) 


that is, r is the radial part and co the angular part of z. Observe that T is one-to-one 
and onto, because T (z) = (r, co) if and only if z = rco/||c«>|| 1 = T~ l (r, co). 

Now define a measure, S, on S = S 2 fl [0, 00 ) by 

S(B) = e [0, oo) : || Z |U > 1 ， z/WzheB}) (8.16) 

for Borel subsets B of S. The measure S is called the spectral measure. It is 
determined uniquely by the exponent measure 〜and the chosen norms by (8.16) 
and (8.19) below. The homogeneity of expressed in (8.11) implies 

M*({ze [0,oo) : llzlU >r, z/||z|| 2 e B}) = r^S(B) 

for 0 < r < oo and Borel subsets B of S. The interpretation is that in polar coor¬ 
dinates (r, o>), the exponent measure /x* factors into a product of two measures, 
one in the radial coordinate that is always equal to r _2 dr, and one in the angular 
coordinate, equal to the spectral measure S. This property is usually written as 

^ o T-\dr,d(o) = r- 2 drS(doo), (8.17) 


which is called the spectral decomposition of the exponent measure. It is essentially 
due to de Haan and Resnick (1977)，who considered the special case where the 
two norms are equal to the Euclidean norm. 

The spectral decomposition (8.17) can be used to calculate the integral of a 
real-valued function g on [0, oo) \ {0} with respect to /x 氺 by 


/ g(z)^(dz)= 

^[0,oo)\{0} 



g(ro)/\\(o\\i)r~ 2 dr S(dco) 
g(rco)r~ 2 dr ||w||~ 1 5(dw). 


Conversely, for a real-valued, 5-integrable function / on S, we have 


/(w)5(dw) 




f(z/\\zh)^*(dz). 


(8.18) 


(8.19) 
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Combining ( 8 . 8 ) and (8.18)，we can write V* = — log in terms of the spectral 
measure S: 


vaz)= / i v - > 1 

J[0,oo)\{0} Zj 1 


d 


⑴ j 


5(do>), 


Z G [0, oo]. 


's , • 二 1 vll^lli Zj/ 

J 一 1 

The requirement that the margins of G* are standard Frechet is equivalent to 


( 8 . 20 ) 




S(d(o) = 1, 




( 8 . 21 ) 


Je 

Conversely, any positive measure 5 on 3 satisfying (8.21) is the spectral measure 
of the J-variate extreme value distribution G* = exp(—V*) given by (8.20). In 
terms of the original max-stable distribution function G, we find, combining ( 8 . 6 ) 
and ( 8 . 20 ), 

d 


logG(x) 


Mil 


\ogGj(xj) \ S(dco), x eR d , 


( 8 . 22 ) 


with the convention log(0) = —oo. In case the two norms are equal, then the 
previous formulas simplify slightly as ||w||i = 1 for (<> g S. Finally, combining 
( 8 . 12 ) with ( 8 . 20 ) yields 

d 


Kv) = 


⑴ j 


vj S(d<o), 


v e [0, oo]. 


(8.23) 


.• Vilalii 

•7 = 1 

A useful consequence of (8.23) is that the stable tail dependence function / is 
convex. 


Independence and complete dependence 

Two interesting special cases are those of independence and complete dependence. 
Let G be a multivariate extreme value distribution with spectral measure S as in 
(8.22). Let ej denote the 7 th unit vector in that is, the 7 th coordinate of ej is 
one and all other coordinates are zero. Then 


G(x) = Gj(xj), x eM. d , 


that is, the margins of G are independent, if and only if S consists of point masses 
of size II||i at the points €j/\\ej H 2 , that is, if 


f{, ( o)S(A(o) = Y J \\ejhf(ej/\\e j \\2), 
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for any real-valued, 5-integrable function / on S. On the other hand, let the point 
coo = (coo ,..., coo) be the intersection of S and the line {x G : x\ =...= 
Xd). Then 

d 

G{x) = xeR d , 


that is, the margins of G are completely dependent, if and only if S collapses to a 
single point mass of size ||wo||i/wo at the point coo, that is, if 

f f((o)S(d(o) = i^/( Wo ), 

Je ojo 

for any real-valued, 5-integrable function / on 3. 


Special cases 

We specialize the spectral decomposition (8.20) for a number of choices of the two 
norms, || • ||i and || - || 2 . 


Sum-norm. The most popular choice for the two norms || • ||i and || . ||2 is the 
sum-norm ， ||x|| = ki I + • • • + \xd\- In that case, the spectral measure S is typically 
denoted by H ， and the space it is defined on, S, is equal to the unit simplex, 

Sd = {oj ^ [ 0 , 00) : + ... + 叫 / = 1 }. 

Representations (8.23) and (8.22) become 

r d 

l(v)= / V( w W ) H ( dw )， re[0,oo], (8.24) 

Jsd j=i 

r d 

logG(x)= / /\{coj log Gj( Xj )}H (da), x e (8.25) 

Sd j=i 

and the requirement (8.21) on H reads 

I ojjH (dw) =1, j = 1,..., d. (8.26) 

Js d 

In particular, the total mass of H is always H (Sd) = d . By (8.16), the measure H 
can obtained from the exponent measure /x* through 

H(B) = /x*({z € [0, oo) : zi + • • • + zj > 1， (Zi + • • • + Zd)~ 1 z € (8.27) 

for Borel subsets B of Sd. Independence occurs if and only if H consists of 
unit point masses at the vertices e\,..., of the simplex Sd, while complete 
dependence occurs if and only if H consists of a single point mass of size d at the 
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centre point (l/d, … ， l/d). Representation (8.25) already appears, without proof, 
in Galambos (1978). 

In the bivariate case, d = 2, the unit simplex S 2 is usually identified with the 
unit interval [0, 1] by identifying (co, \ — a)) with co. The spectral measure H is 
then defined on [ 0 , 1 ] and is given by 

//([0, O)]) = /X*({(zi, Z 2 ) € [0, oo ) 2 : Zi +Z 2 > 1 , z\/(z\ Zi) < co}), (8.28) 

for cd G [0, 1]. The constraints on H are 

f coH(dco) = 1= f (1 — (8.29) 

The stable tail dependence function is given by 

l(vi, V 2 ) = f (cov\) v {(1 - co)v 2 }H(do)), (i>i, V 2 ) G [0, oo] 2 . (8.30) 

Euclidean norm. Mainly in the bivariate case, other choices for the two norms 
have been considered as well. If, as originally in de Haan and Resnick (1977 )， 
both norms are equal to the Euclidean norm, ||(xi ， X 2 )|| = (|xi | 2 + l^l 2 ) 1 ^ 2 , then 
by (8.22) and (8.23), identifying co = (cosO, sin 沒 ） with 0 e [0, n/2], 

l(vi,v 2 )= f {cos ⑹ in} v {sin ⑹ i; 2 } 只 d 0 ), 

log G(x u x 2 )= f {cos ⑹ logG i(^i)}a {sin ⑹ log G 2 (x 2 )}Sm, 

«/[0,;r/2] 

where 5 is a finite measure on [ 0 , n/2] such that 

I cos(0)S(d0) = 1=1 sin(6>)5(d6>), 

J[0,n/2] J[0 ,jt/2] 

see (8.21). By (8.16), the spectral measure S and the exponent measure /x* are 
related through 

S([0, 0]) = m*({(zi,z 2 ) e [0, oo ) 2 : zj + zl>h zi/zi < tan( 0 )}), (8.31) 

for 0 g [0, 7 r/ 2 ]. Independence occurs if and only if S puts unit point masses at 
the end-points 0 and 丌 /2. On the other hand, complete dependence occurs if and 
only if S puts a single point mass of size V2 at the mid-point rc/4. 

Max-norm and Euclidean norm. Einmahl et al. (1997) set the first norm equal to 
the max-norm, ||(xi, X 2 )||i = max(|xi|, \x 2 \) and the second norm to the Euclidean 
norm, ||(xi, ^)||2 = (kil 2 + l^l 2 ) 1 ^ 2 - Identifying again co = (cosO, sin^) with 
0 e [0, 7 t/2 ], we find from (8.23) 

l( Vl , V2 )= f [cot(0 v Jt/4)vi}v {tan( 6 > A 7T/4)v 2 }S(dO) (8.32) 
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as well as from (8.22) 

\ogG(xux 2 ) 

=I {cot(^ v tt/ 4) log Gi(x\)} A {tan(^ A tt/ 4) log G 2 (x 2 )}S(dO), 

where 5 is a finite measure on [0, rc/2] such that 

f cot(0 v ;r/4)^(d6>) = 1= f tan(6> A 7t/4)S(dO), 

see (8.21). The relation between the spectral measure S and the exponent measure 
/x* is that 

S([0, 0]) = M*({(zi, z 2 ) e [0, oo) 2 : Zi vz 2 >l, z 2 /Zi < tan ⑹ })， (8.33) 

for 0 g [0, 丌 /2]. Independence occurs if and only if S puts unit point masses at 
the end-points 0 and 丌 /2. On the other hand, complete dependence occurs if and 
only if S degenerates to a unit point mass at the mid-point jt/4. 

The spectral measure S considered in (8.32) is often connected to an alternative 
exponent measure directly related to /. Let 少 denote the transformation of [0, oo] 2 
into itself given by V 2 ) = (l/i ； i, I/U 2 ). Consider the measure v = o x// on 
the space (0, oo] 2 \ {( 00 , oo)}. By (8.8) and (8.12), 

l(v\, v 2 ) = v((0, oo] 2 \ ([vu 00 ] x [v 2 , 00 ])), (i ； i, v 2 ) e [0, oo] 2 . 

The measure v is sometimes called the exponent measure as well. By (8.11), it 
satisfies the homogeneity property 

v(s - ) = sv( - ), 0 < 5 1 < oo. 

The spectral measure S of (8.32) can be found from v through 

5([0, ^]) = v({(i ； i, v 2 ) G (0, oo] 2 : viAv 2 < 1, vi/v 2 < tan ⑹ })， 


for 0 G [0, tc/2]. Independence occurs if and only if v is concentrated on the lines 
through infinity {(ui, 00 ) : 0 < i；i < 00 } and {( 00 , V 2 ) : 0 < V 2 < 00 }, whereas 
complete dependence occurs if and only if v is concentrated on the diagonal. 


Spectral densities 

Consider the spectral measure H of (8.27) on the unit simplex Sd = {(o [0, 00 ) : 

co\ cod = 1} for J > 2. If is absolutely continuous, then we may recon¬ 

struct the densities of H from the derivatives of the function = — log G^. We 
say “densities” and not “density” because, in general, H may have a density on 
the interior of Sd and also on each of the lower-dimensional subspaces of Sd. 
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More specifically, the unit simplex Sd is partitioned in a natural way in so- 
called faces ，with dimensions ranging from 0 (the d vertices) up to J — 1 (the 
interior of Sd)- In particular, for a non-empty subset a of {1,..., J}, define 

Sd,a = {(o e S d ： ojj > 0 if j g a; coj = 0 if j ^ a}. 

For instance, if J = 2, we have 

^, { i} = {(l,0)}, 

^,{ 2 } = {( 0 , 1 )}, 

& ， {1,2} = {( ⑴， I — OJ) ： 0 < W < 1}. 

In d = 3 dimensions, we obtain three vertices, three edges, and the interior of the 
triangle. For general d, the sets Sd, a partition Sd into 2 d — l subsets. 

Now let us consider the restriction of the spectral measure H to the face Sd, a . 
First, if a is a singleton, {j}, then Sd, a is just the vertex {ej}, the jth unit vector 
in Even if G* is absolutely continuous, the spectral measure H may still 
assign positive mass to these vertices; for instance, when the margins of are 
independent, H({ej}) = 1 for all j = l,d. Let us denote this mass by h a = 
h a (ej), to be thought of as the density of H with respect to the unit point mass at ej. 

Next, let a be a subset of {l,..., d} with \a\, the number of elements of a, 
at least two. Clearly, the number of free variables in Sd, a is k = \a\ — Now, 
assume that the spectral measure H has a density h a on Sd， a , the latter set being 
identified with the open region of M. k defined by 

= {ll G (0, OO)^ I U\ + . • • + i /灸 < 1}. 


More precisely, integrals over Sd, a with respect to H may be calculated through 



f((o)H(dco)= 


/ f{Ia(u)}h a (u)dui - - du k ; 


here I a is the map identifying u in with I a (u) = w in Sd' a , that is, if a = 

{ji, •.. ， = U[ (/ = 1 ， ... ， k),a>j k+1 = 1 — (mi H - h Uk),^ndo)j = 

0 for j ^ a. 

Coles and Tawn (1991) found a way to compute the spectral densities h a from 
the partial derivatives of V*. For a = {j\,..., j m ] C {1,..., J} and (Zj)jea such 
that 0 < Zj < oo, we have 


lim 


a m v* 


z ^° 力 … dz jm 


⑵ 


-(m+1) 


ha (〒' 


^jm- 


where r = z i- ( 8 . 34 ) 


A new proof of (8.34) is given in section 8.6.1. 
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It is useful to spell out (8.34) explicitly in the bivariate case and to rewrite it in 
terms of the stable tail dependence function, l(v\, V 2 ) = l/i> 2 ). As usual, 

we identify (oj, \ — co) in S 2 with co in [0, 1]. The point masses of // on 0 and 
1 are 


//({ 0 }) = - lim ^(zu Z 2 ) 
zi—0 0Z2 

i/({l}) = - lim ^(zuzi) 
Z 2—0 az\ 


dl 

lim —(ui, v 2 ), 

ui->oo di)2 

dl 

lim —(ui, v 2 ), 

U 2—00 


while its density on the interior of the unit interval is, for 0 < co < l, 

a 2 y* 


h(co) 


dzidzi 


(co, l — co) 


(8.35) 


= - {«(1 - - w, co). (8.36) 

dv\dV2 

Example 8.1 The bivariate asymmetric logistic model (Tawn 1988a) is given by 
its stable tail dependence function 


l(Vl,V 2 ) = (1 - x!n)Vl + (1 — Xk2)v 2 + {0Ml) 1/a + (中 2 Vl) l/a f. 


Here 0 < a < 1 and 0 < \//j < l for j = 1, 2. Computing the partial derivatives of 
l and applying (8.35) and (8.36), we find 7/({0}) = 1 — 中 2 , //({l}) = 1 — xj/i, and 

h(co) = (a~ l - 1 )( 少 1 少 2 ) 1/a W(l — oj)} l/a ~ 2 [{xln(l - oj)} l/a + (xlf 2 oj) l/a r~ 2 

(8.37) 

for 0 < a) < 1. 


Change of norms 


The spectral measure S of an exponent measure /x* depends on the choice of 
norms II . 11/ (/ = 1,2). This choice is basically a matter of convenience, and in 
the literature, different authors use different norms, see above. Still, the transition 


from one choice of norms to another only involves a simple formula. 

Let II - 11/ and || • II; (/ = 1, 2) be four norms on M. d and let S and S f be the 
spectral measures of the exponent measure /x* w.r.t. || • ||/ (/ = 1,2) and || - ||J 
(i = 1, 2), respectively. The supports of S and S f are 3 = {(«>> 0 : ||w ||2 = 1} 
and S’ = {(o f > 0 : ||w’||’ 2 =1}. Then for a real-valued, 5-integrable function / 
on S, we have by (8.18) and (8.19) 



f((o)S(d(o)= 




f\ 


(0 ! 、 


Mi 


5' , (da> , ). 


In particular, for a Borel subset B of 3, we have 


S(B) = 




ii^ih 


5' , (da> , ). 


(8.38) 
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Spectral functions 


Alternatively, a simple multivariate distribution function is max-stable if and 
only if there exist non-negative, integrable functions /i,..., on [0, 1] satisfying 
/q 1 fj(t)dt = 1 such that 

-log G*(z)= [' V ^-dt, z e [0, oo] (8.39) 

Jo j =l Zj 

(de Haan 1984). That this defines a distribution function follows from a special 
point-process construction described in the same paper; see also section 9.2.1, 
where this point-process construction will be used to motivate parametric mod¬ 
els for multivariate extreme value distributions. Clearly, G* has standard Frechet 
margins and is max-stable. 

To show that a representation (8.39) is always possible, let be a simple 
multivariate extreme value distribution with spectral measure S for some equal 
choice of the two norms. Let Q(-) = 5(.)/5(S)，which is a probability measure 
on S. By a multivariate extension of the probability integral transform, there exist 
non-negative functions gi ， .. . ，抑 on [0, 1] such that, with U uniformly distributed 
on (0, 1), the distribution of the random vector ... ， gd(U)) is Q (Skorohod 

1956). Then for z € [0, oo], 

-log G*(Z) = 5(S) f V ^G(d 0 >) = S(S) f 1 V ^-dt. 

Jsd j =1 Z J Jo j =1 Zj 

Defining fj = S(E)gj, we obtain (8.39). 

In terms of the original max-stable distribution function G, we find, combining 
(8.6) and (8.39), 

/»l ^ 

logG(x) = J /\{fj(t)logGj(xj)}dt, (8.40) 

;=i 

Observe that the spectral functions fj in (8.40) are not unique. In particular, 
independence arises as soon as the supports of the spectral functions are disjoint, 
while total dependence arises as soon as all spectral functions are equal. 


8.2.4 Properties of max-stable distributions 

The fact that a max-stable distribution function G is linked to its margins 
Gi,..., Gj by means of a spectral measure H as in (8.22) has large repercussions 
on its dependence structure. 
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Positive association 

A multivariate extreme value distribution G, in the terminology of Lehmann (1966), 
is necessarily positively quadrant dependent, that is, 

G(x) > G x { Xl )---G d {x d ), xeR d , (8.41) 

a property originally noted in the bivariate case by Sibuya (1960), and in the general 
case by Tiago de Oliveira (1962/1963). In particular, a random variable Y with 
distribution function G has cov[^(F/), fj(Yj)] > 0 for any I < i, j < d and any 
pair of non-decreasing functions ft and fj such that the relevant expectations exist. 
Relation (8.41) follows from (8.14) and the fact that l(v) < i>i + • • • + Observe 
also that (8.41) implies that G(x) > 0 as soon as Gj(xj) > 0 for all j = 1 ， ... ， d. 

Multivariate extreme value distributions satisfy even stronger concepts of pos¬ 
itive dependence. Marshall and Olkin (1983) show that they are associated (Esary 
et al 1967) in the sense that cov[^(F), rj(Y)] > 0 for every pair of non-decreasing 
functions § and rj on for which the relevant expectations exist; see also Resnick 
(1987) for an alternative proof. Bivariate extreme value distributions are shown to 
satisfy a property called total positivity of order two by Kimeldorf and Sampson 
(1987) and to be monotone regression dependent (Lehmann 1966) by Garralda 
Guillem (2000). 


Independence and complete dependence 


Next we turn to characterizations of the two extreme cases of independence and 
complete dependence. We start with the case of independence. Takahashi (1994) 
showed that G(x) = G\(x\) - - - G d(Xd) for all x e if and only if there exists y G 
Rd with 0 < Gj(yj) < 1 for all j = l,... ,d such that G(y) = G\(y\) - - - Gd(yd)- 

The ‘if’-part may be proved as follows. Denoting vj = — log Gj(yj), we must 
have f Sd l^=iUojVj) - \/1 = i(a)jVj)}H(d(o) = 0. Since the integrand is non¬ 
negative, the //-measure of the set where it is positive must be zero. But then, 
since 0 < < oo for all j = l, d, the set {o> G 5^ : 彐 1 $ / < j < d : a>i > 
0, (Oj > 0} must have //-measure zero. Consequently, H is concentrated on the 
complement of the set above, which is equal to {q,... ， e^}. Restriction (8.26) 
forces H({ej}) = 1 for all j, which by (8.25) implies independence. 

A closely related characterization of independence, going back to Berman 
(1961), is in terms of the bivariate margins G" for 1 < / < j < d, that is, the 
bivariate distribution functions of the pairs (F“ Yj), where F is a random vector 
with distribution function G. We have G(x) = G\(x\) - - - Gd(xd) for all x G 
if and only if there exists y G with 0 < Gj(yj) < 1 for all j = , d such 

that Gij(yt , yj) = Gi(yi)Gj(yj) for all 1 < / < j < d. In words, pairwise inde¬ 
pendence implies total independence. The proof is similar as the one of the char¬ 
acterization above. 

On the other extreme is the case of complete dependence. Takahashi (1994) 
noted that G(x) = G\(x\) A ••• A Gd(xd) for all x G M. d if and only if there exists 
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j G with 0 < G\(y\) = = Gd(yd) < 1 such that G(y) = G\(y\). The ‘if’- 

part can be easily proven as follows. Denoting v = — log G\(y\) G (0, oo), we 
have by (8.26) and (8.25), for / = 1,..., d. 



fs d 


\J (a>jV)H(do)) 


v I coiH(dco) 
Js d 


= 0 . 


Since the integrand on the left is non-negative, the //-measure of the set 
{o> G 5^/ : co\ V — • V cod > coi] must be zero for all / = 1,..., d. Hence H must 
be concentrated on the mid-point (l/d, ...， I/d). Restriction (8.26) then forces 
//({(1/J,... ， l/d)}) = d, which by (8.25) implies complete dependence. 


Closure property 


Finally, we mention the following closure properties of the class of max-stable 
distributions. If G is a max-stable distribution function with spectral measure S 
as in (8.22), then for all 0 < ^ < oo, the function is a max-stable distribution 
function as well, its spectral measure being again S. More generally, if Gi,..., G m 
are J-variate max-stable distribution functions such that for each j = 1,..., J the 
marginal distribution functions Gij are the same for all i = 1 ， … ， m, then for all 
non-negative Pu …， Pm such that + • • • + > 0, the distribution function 

G = Gf 1 ••- (8.42) 

is max-stable as well (Gumbel 1962), its spectral measure being S = w\S\ + • • • + 

w m S m , where Wj = +- 1- and where 5 / is the spectral measure of 

Gi. In particular, any convex combination of stable tail dependence functions is 
again a stable tail dependence function. 

8.2.5 Bivariate case 

Let G be a bivariate extreme value distribution function with margins G\ and G:. 
Apart from the spectral measure H or the stable tail dependence function /, various 
alternative ways to describe the dependence structure of G have been proposed in 
the literature. 


Pickands dependence function 

Quite popular is Pickands dependence function 


A(f) = /(1 —M )， t e [0, 11 (8.43) 

(Pickands 1981). Equation (8.43) is Pickands original definition. Later authors, 
including Pickands himself, sometimes define Pickands dependence function as 
l(t, l — t) = A(l — t) for t G [0, 1]. Pickands dependence function can be viewed 
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as the restriction of the stable tail dependence function to the unit simplex. In a 
higher-dimensional setting, the restriction of the stable tail dependence function to 
the unit simplex is sometimes called Pickands dependence function as well. 

The function A completely determines the stable tail dependence function /, as 

卟 i ，巧 )= ㈨ + v 2 )A ( —(8.44) 
\^1 + V 2 J 


for 0 < Vj < oo (7 = 1 , 2) such that v\ V 2 > 0. In particular, a bivariate max- 
stable distribution G is determined by its margins, G\ and G 2 , and its Pickands 
dependence function, A, through 


0(yi, yi) = exp log{Gi(ji)G 2 (j 2 )}A 


log{G 2 (y 2 )} 

^og{G\(y\)G2(y2)} 



(8.45) 


for (yuyi) e R 2 . 

By (L3) and (L4), a Pickands dependence function A satisfies the following 
two properties: 


(Al) (l-t)vt < A(t) < 1 for r g [0, 1]; 


(A2) A is convex. 

In (Al), the lower bound, A{t) = (l — t) v t, corresponds to complete dependence, 
G(x\, X 2 ) = G\(x\) A G 2 (X 2 )， whereas the upper bound, A(t) = 1, corresponds to 
independence, G(x\,X 2 ) = G\{x\,X 2 ). 

We can connect Pickands dependence function to the spectral measure H of /x* 
with respect to the sum-norm, see (8.28). Combining (8.30) with (8.43), we find 

A(t)= [ (8.46) 

=t 1 (1 — co)H(dco) + (1 — 0 / coH(dco) 


Now by (8.29 )， 


L 


ojH(doj) = H((t, 1]) 

= p-//([(U])} 


~ [ ( 1 - 


co)H(dco) 


_ < 


1 - / (1 — a>)H(doj) 


1 —7/([0 ， r])+ / 

J[o,t] 


so that 


A ⑴ =f (1 — co)H(dco) + (1-0(1- H([0, t])}. 




with A’(l) = sup 0<f<1 A f {t). If A! is absolutely continuous, then H is absolutely 
continuous on the interior of the unit interval with density h = A”. Incidentally, 
equation (8.47) shows that any real-valued function A defined on [0, 1] that satisfies 
the properties (A1)-(A2) is necessarily a Pickands dependence function, with the 
spectral measure H given by (8.47). 

Pickands dependence function A can also be linked to the spectral measure 
S for a general choice of the two norms || - ||/ (/ = 1, 2). Combining (8.23) and 
(8.43), 

A(t) = [ (1 - 0 „, ) V \t W2 } 5(d(® 1; co 2 )). (8.49) 

JE ||(cDi, C02)\\ 1 ) a>2)\\ 1 ) 

Retrieving S from A is more difficult. First, we need to find H in terms of A 
using (8.47), and second, we need to compute S in terms of H using (8.38), which 
specializes to 


for Borel subsets B of S. 



270 


MULTIVARIATE EXTREME VALUE THEORY 


Huang’s level sets. In some sense dual to Pickands dependence function, A, are 
the level sets of /, 

Qc = {(^ 1 , V 2 ) ^ [0, oo ) 2 : l(v u v 2 ) = c], 0 < c < 00 , 


first studied by Huang (1992); see also de Haan and de Ronde (1998). Clearly, 

Q c — {(rco, r(l — co)) : 0 < co < l, r = c/A(l — co)}, 0 < c < 00 . 

The set Q c is the graph of a non-increasing, concave function through the points 
(0, c) and (c, 0), and Q c = cQ\. From Q c , we can reconstruct A and hence 
l. Independence occurs if Q c = {(vi, V 2 ) G [0, oo ) 2 : v\ V 2 = c], and complete 
dependence occurs if Q c = w) e [0, oo ) 2 : v\VV 2 = c}. 


Some history 

The oldest descriptions of bivariate extreme value distributions date back to Tiago 
de Oliveira (1958), Sibuya (1960) and Geffroy (1958/59); see also the early 
review by Gumbel (1962). Each of these authors introduced a different function 
to characterize the dependence structure. However, the representation discovered 
by Pickands (1981) turned out to be far more convenient than its predecessors 
and reduced the popularity of the latter to virtually zero. Still, Obrenetov (1991) 
studied multivariate extensions of the dependence functions of Tiago de Oliveira 
and Sibuya. 

Tiago de Oliveira (1958, 1962) obtained the representation 


G(x\,X 2 ) = exp 


log{G 1 (x 2 )G 2 (x 2 )}/： log 


logGi(^i)l\ 

logG 2 (x 2 )\) 


The dependence function k is related to Pickands dependence function A by 


k{x) — A 


1 


,e^ + l 


x g M. 


(8.50) 


Since (1 + t)k(\ogt) = l(t, 1) for 0 < ? < 00 , necessary and sufficient conditions 
for a function A ： on M to be a Tiago de Oliveira dependence function are (i) ? v 1 < 
(1 + t)k(\ogt) < t l and (ii) (1 + t)k(\ogt) is a convex function of t. 

Next, Geffroy (1958/59) considered the representation 


G(x\,X2) = exp 


/logGi(xi)\l 

\\0gG 2 (X2))\ 


log G2(x2) 


In terms of the stable tail dependence function /, we have 


cp(t) = l(t, 1) — 1, 0 < t < oo. 

Necessary and sufficient conditions for a function (p on (0, 00 ) to be a Geffroy 
dependence function are (i) 0 V (? — 1) < x(0 ^ ^ and (ii) x is convex. 
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Finally, Sibuya (1960) introduced the representation 


G(xi,x 2 ) = exp 1 + / 


/lQgG 2 (X2)\l 

'viogGiCxoyj 


log{Gi(xi)} + log{G 2 (x 2 )} 


In terms of the stable tail dependence function, /， we have 


X(t) = 1(1 ， t) — (1 + t), 0 < t < cx). 

Hence, necessary and sufficient conditions for a function / on (0, oo) to be a 
Sibuya dependence function are (i) —(t A 1) < x(0 ^ 0 and (ii) x is convex. 

8.2.6 Other choices for the margins 

Reductions to other margins than standard Frechet have been considered in the 
literature as well, other popular distributions being the exponential, extreme value 
Weibull, Gumbel, or uniform distribution. Although of course the choice of margi¬ 
nal distribution essentially makes no difference, some properties or characteriza¬ 
tions are most naturally seen for one particular choice. Also, different choices 
sometimes motivate different statistical methods. 


Exponential margins 

One such choice, by Pickands (1981), is the standard exponential distribution, 
which is a univariate extreme value distribution for minima rather than for maxima. 
Let the random vector Y have the extreme value distribution function G. Then 
(— log G\(Y \),..., — log Gd(Yd)) has a multivariate extreme value distribution for 
minima with standard exponential margins, P [— log Gj(Yj) < v] = l — e _u for 
v > 0. Its joint survivor function is given by 

P[-logGi(Fi) > vi,...,-log G d (Y d ) > v d ] = exp{-Z(D)} (8.51) 

for v g [0, oo]. In the bivariate case, 

P[-logGi(Fi) > vi, -\ogG 2 (Y 2 ) > v 2 ] = exp -(i；i -\-v 2 )A ( — ^ — 

V ^1 + ^2 

for (vi, V 2 ) G [ 0 , oo] 2 . 

Extreme value Weibull margins 

Rather than exponential margins, Falk et al. (1994) prefer extreme value Weibull 
or reversed exponential margins. For w e [— 00 , 0], we have 

P[\ogGi(Yi) < w u ...,\ogG d (Y d ) < w d ] = exp{-/(-u;)}, 
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with marginal distribution P[log G j(Yj) < w] = q w for w < 0. In the bivari¬ 
ate case, 

( Wo 

- ; - 

+ W2 

for (w\, W 2 ) G [— 00 , 0] 2 . 

Gumbel margins 

In the early days of multivariate extreme value theory, it was customary to 
standardize to Gumbel margins, probably by the influence of the classical 
monograph by Gumbel (1958). Recall that the Gumbel distribution function 
is defined by A(x) = exp(—e _ i) for x G M. If G is a multivariate extreme 
value distribution function, then the distribution function of the random vector 
(— log {— log Gi(Fi)},..., — log {— log Gd(Yd)}) is, with slight abuse of notation, 
given by 

A(jc) = exp{-Z(e-' …’ e- 々)}， x e R d , 

which is a multivariate extreme value distribution function with Gumbel margins. 
In the bivariate case, we find 

A(x\,X2) = exp {-(e _xi + Q~ X2 )k(x 2 - x \)}, G M 2 , 

where k is Tiago de Oliveira’s dependence function (8.50). 

Uniform margins 

A popular way to describe the dependence structure of a multivariate distribution 
function is through its copula. In general, for any multivariate distribution function 
F with margins F\, , Fd, there exists a distribution function Cp with uniform 
margins on (0, 1) such that 

F(x) = C f {Fi(xi), F d (x d )}, x eR d 

(Sklar 1959). Such a C/r is called a copula for F . If the margins, Fj, are continuous, 
then the copula, C/r, is unique and is given by 

C F (U) = F^(u d )}, H e [0, 1 ] 气 

so C/r is the distribution function of the random vector (Fi(Xi),..., 
where Z is a random vector with distribution function F . Here F^~ denotes the 
quantile function of Fj, defined by Fj^(p) = infjx e M : F{x) > p}. 

The copula of a multivariate extreme value distribution G is the distribution 
function of the random vector (Gi(7i),... ， Gj(Fj)) and is given by 

C G ⑻ = exp [- /{—log ㈤) ，… ，一 logfe )}]， u e [0, l] d , (8.52) 
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see (8.14). Such a copula necessarily satisfies the stability property 

C s G (u) = C G {u\,...,u s d ), u e [0, \} d . (8.53) 


Conversely, any copula that satisfies (8.53) is the copula of a multivariate extreme 
value distribution. A bivariate extreme value copula can be written in terms of 
Pickands dependence function as 


Cq(u, v) = exp \og(uv)A 


log ⑻ !- 

log(Ml；) fj ’ 


see (8.44) and (8.52). 


(u, V) G [0, l] 2 , 


(8.54) 


8.2.7 Summary measures for extremal dependence 

The dependence structure of a max-stable distribution can be described in various 
ways: the preceding paragraphs featured exponent measures, spectral measures, the 
stable tail dependence function, Pickands dependence function, the copula, and so 
on. These quantities are infinite-dimensional objects and therefore not always easy 
to handle. A possible solution consists of choosing a finite-dimensional but hope¬ 
fully large enough sub-class of dependence structures, that is, restricting attention 
to a parametric model (section 9.2). An alternative solution is to summarize the 
main properties of the dependence structure in a number of well-chosen coefficients 
that give a rough but representative picture of the full dependence structure. 


Extremal coefficients 


Let G be a max-stable distribution function with margins G\, ..., Gd, spectral mea¬ 
sure S with respect to two norms || ||, (/ = 1 ， 2) on R 气 and stable tail dependence 
function /. For a non-empty subset V of {l,d], let ey be the J-dimensional 
vector of which the jth coordinate is one or zero according to j e V or j ^ V. 
For such V, the coefficients 


0y = = 


f \Z(0)j/\\(0h)S(d(0) 

Jz jeV 


(8.55) 


satisfy 

P[Yj < G^(p), Wj e V] = p ev , 0<p<l, 

where F is a random vector with distribution function G (Coles 1993; Smith 1991). 
In particular, stronger dependence corresponds to smaller extremal coefficients 
Clearly, 沒 0 = 0 and 0^ = 1 for all j = 1 ， … ， d，so that the only relevant 
coefficients 0y are those for which V has at least two elements. 

Hence, in the bivariate case, the only non-trivial coefficient is 


0 = 0 {h2} =l(hl) = 2A(l/2), 


(8.56) 
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where A is Pickands dependence function (8.43). The coefficient 0 must lie in the 
interval [1 ，2 ], and satisfies 

P[Yx < GTip), Y 2 < Gt(p)] = p e , 0 < /) < 1. 

In view of the conditions (A1)-(A2) on A, it is clear that 0 strongly restricts the 
shape of A. In particular, independence occurs if and only if 0 = 2, while complete 
dependence occurs if and only if 汐 =1. 

In the J-dimensional case, we have 1 < Oy < \ V\ for non-empty V C 
{1 ,..., d], where |V| denotes the number of elements of V. The upper and lower 
bounds correspond to independence and complete dependence, respectively. Con¬ 
versely, = d implies independence, whereas = 1 implies complete 

dependence, as follows from the characterizations due to Takahashi (1994) given 
earlier. Schlather and Tawn (2002, 2003) give necessary and sufficient conditions 
on a collection of numbers Oy indexed by the non-empty subsets V of {1,..., J} 
to be the extremal coefficients of a multivariate extreme value distribution. 

Other summary measures for bivariate dependence 

Two popular distribution-free measures of dependence between the components 
of a bivariate random vector are Kendall’s tau and Spearman’s rho. Applied to a 
bivariate max-stable distribution, they can also be used as useful summaries of the 
dependence structure. 

Let F be a bivariate distribution function, and let (X\, Y\) and (X 2 , 12 ) be inde¬ 
pendent random vectors with distribution function F. Kendall’s tau is defined by 

r = P[(X, - X 2 )(Y 1 - Y 2 ) > 0] - PliX, - X 2 )(Y 1 - r 2 ) < 0], (8.57) 

that is, the difference between the probabilities of concordance and discordance. If 
the margins, Fx and Fy, of F are continuous, and if CV is the (necessarily unique) 
copula function of F, then r is given by 

r=4E[C F (U, V)] - 1, 

where (U, V) = (Fx(X), Fy(Y)) has distribution function C (Nelsen 1999). Next, 
Spearman’s rho is defined as the Pearson correlation coefficient of (U, V), that is, 

p s = covr(U, V) = \2E[UV]-3. 

Tiago de Oliveira (1980) already gave expressions for Kendall’s tau and Spear¬ 
man^ rho of a bivariate extreme value distribution G in terms of his dependence 
function k, see (8.50). These expressions were rediscovered later, but then in 
terms of Pickands dependence function A, see (8.54). Let A\t) be the right-hand 
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derivative of A in t e [0, 1); also let A’ （ l) = sup 0<?<1 A\t). Then Kendall’s tau 
is given by 


r = 



— 0 
A(t) 


dA\t) 



l + d-0 


A\t) 1 




dt, 


(Ghoudi et al 1998; Htirlimann 2003), and Spearman’s rho by 


Ps = corr[Gi(Fi), G 2 (Y 2 )] = 12 




(Hiirlimann 2003); see also the unpublished 1995 University Laval doctoral disser¬ 
tation by A Khoudraji. For both r and ps, the extreme cases 0 and 1 correspond 
to independence and complete dependence, respectively. Hiirlimann (2003) also 
shows that for bivariate extreme value copulas, 

-1 + (1 + 3r ) 1/2 < ps < min Qr, 2r - 



thereby proving a special case of a conjecture of Hutchinson and Lai (1990). 

To conclude, dependence measures for bivariate extreme value distributions 
can also be obtained by studying the correlation of the two components of the 
random vector for a particular choice of marginal distributions. In all cases, they 
can be expressed in terms of Pickands dependence function, A. First, the reduction 
to uniform margins on (0, 1) leads to Spearman’s rho, ps. Next, choosing Gumbel 
margins, Tiago de Oliveira (1980) obtains 

6 log A(t) 

corrt-logl-logG!^)}, -log{-logG 2 (F 2 )}] = - / (8.58) 

丌 2 Jo - 0 


Finally, Tawn (1988a), choosing standard exponential margins, mentions 


corr [- logGi(Fi), -logG 2 (l 2 )]= 




(8.59) 


For all correlation coefficients, the two extreme cases 0 and 1 correspond to inde¬ 
pendence and complete dependence, respectively. 


8.3 The Domain of Attraction 

Consider again the domain-of-attraction equation 

lim F n (a n x b n ) = G(jc), x e [—oo, oo], (8.60) 
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where a n g (0, oo) and b n g W 1 . In section 8.2, we have focused on the right-hand 
side of this equation, that is, on the class of multivariate extreme value distributions. 
In this section, then, we will consider the left-hand side of equation (8.60). More 
precisely, we will formulate a range of equivalent descriptions of the domain 
of attraction, D(G), of an extreme value distribution function G. Of particular 
interest will be the connection between the dependence structure at extreme levels 
of a distribution function F in D(G) and the various equivalent descriptions of the 
dependence structure of G. The reinforcement of (8.60) to density convergence is 
briefly mentioned in section 8.4. 

The domain-of-attraction conditions form the groundwork for the statistical 
threshold methods in section 9.4. The conditions are always phrased as limit rela¬ 
tions, which, taken as approximate equalities, generate approximations of F over 
certain regions of its support in terms of G. These approximations then serve as a 
tool to devise statistical models and corresponding inference methods. For proper 
understanding, we will denote the approximations by the symbol 一 not to 

be confused, by the way, with the symbol 〜 which has the precise meaning 
a(0 〜 b{t) if and only if a(t)/b(t) - > 1 as f tends to its limit value, typically 0 
or oo. 


8.3.1 General conditions 

The domain-of-attraction condition as stated in (8.60) is not very convenient to 
work with. In itself, it does not tell us much about the distribution of a random 
vector X with distribution function F given that X is in some sense extreme. To 
obtain that kind of information, we have to manipulate (8.60) carefully. 


Tail function 

Writing F n = [1 — n~ l {n(l — F)}] n and using the fact that (1 — n~ x x n ) n q~ x g 
[ 0 , 1 ] if and only if -> x G [ 0 , oo] as n ^ oo, we find that (8.60) holds if and 
only if 


lim n[l — F(a n x + b n )} = — log G(x), x e [—oo, oo], (8.61) 

n^-oo ^ 

with the usual convention — log(0) = oo. By max-stability (8.2), we may rewrite 
the previous equation as 

1 - F(a n x + 〜） 〜 —log G(a n x-\- p n ) 

〜 1 — G(a n x + P n ), n ^ oo, (8.62) 


for x such that 0 < G(x) < 1 . 

Relation (8.62) may be used as a starting point for statistical inference on 
F(x) in jc-regions for which each Fj(xj) is sufficiently close to one. Let u be 
such that Fj{uj) is close to one for every j = l,d. Equation (8.62) suggests 
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the approximation 

F(x) a G(a n a~ l x - a n a~ l b n p n ) =: G(x), x >u. 

Since G and G differ only in scale and location, G is an extreme value distribution 
as well, with the same stable tail dependence function, /, and the same extreme 
value indices, yj. Hence 

F(x) ^ exp{—/(v)}, x > u, (8.63) 

where, for j = d and xj > uj, 

W =- log Gjixj) 

^( l + Yj ^~) + J = x j( l + yj 

with 入 ） = — log Gj(uj) and crj = dj + yj(uf — jlf). Together, equations (8.63) 
and (8.64) form a semi-parametric model for F in the region [m, oo). If we also 
assume a parametric model for /, we end up with a fully parametric model. This 
is the basis for the so-called censored-likelihood approach of Ledford and Tawn 
(1996), see section 9.4.2. 

Multivariate-threshold exceedances 

Like in the univariate case, the domain-of-attraction condition (8.60) can be cast 
in terms of exceedances over a high threshold. The event {X ^ b n ] is called an 
exceedance over the (multivariate) threshold b n . It entails that there is at least one 
coordinate variable Xj that exceeds the corresponding threshold b n j, although the 
precise coordinate where this happens remains unspecified. Conditionally on the 
exceedance X ^ b n , the vector a~ x {X — b n ) is the vector of (scaled) excesses; 
observe that some coordinates of the excess vector may be negative, although 
under the conditioning event, at least one coordinate must be positive. 

We are interested in the asymptotic distribution of the excess vector — 

b n ) conditionally on X b n . Without loss of generality, assume that 0 < G( 0 ) < 1 . 
For x such that G(x) > 0, we obtain after some calculation that 

P[a 了 — b n ) < x \ X < b n ] ^ --- log ― 。 ( 幻 — 1 n —> oo. 

n - 1 二 -logG(O) B [G(x A 0)\ 

Now let q = (q\,, q^) with qj the lower end-point of Gj, the 7 th margin of 
G. Then the limit relation above implies 

P[a ； t l (X-b n ) vq e-\X ^b n ]^ P[W e-l n —> 00 , (8.65) 

where W is a random vector with distribution function 

1 G(x) } 
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Observe that Wj > qj and Vy=i ^0 > ^ with probability one, although P[Wj < 
0] = lim n —oo P[Xj < b n j I X ^ b n ] may be positive. In view of (8.65)，we may 
call the distribution of W a multivariate Generalized Pareto (GP) distribution. Up to 
our knowledge, it has not been studied before. It seems likely that (8.65) and ( 8 . 66 ) 
can form the basis of new statistical procedures modelling multivariate-threshold 
excesses. 

From (8.65), we can also derive the asymptotic distribution of the excess vector 
given that there is a threshold exceedance in a specific coordinate: for j = 1 ,..., 
we have 

P[a-\X-b n ) vqe-\Xj> b nJ ] ^ P[We-\ Wj > 0] (8.67) 

as n ^ oo. Observe that the distribution of Wj given Wj > 0 is univariate GP. 

A closely related definition of multivariate GP distributions appears in Tajvidi 
(1996). For a multivariate-threshold exceedance {X ^ b n ], he suggested to set 
every coordinate of the excess vector where the threshold is not exceeded equal to 
zero. In the notation of (8.65), this gives 

P[a~\x - ft„) v 0 e • I Z ^ b n ] ^ P[W v 0 e • ], n -> oo, (8.68) 
the distribution of the limiting random vector W V 0 being given by 

1 G(x)} 

P[WvO<x] = ——— log , x> 0 . 

-logG(O) G(0) J 

Observe that in two dimensions or more, the margins of this vector can be zero 
with positive probability. 

In the bivariate case, yet another definition of multivariate GP distributions is 
proposed by Kaufmann and Reiss (1995): for a bivariate extreme value distribution 
function G with stable tail dependence function l as in (8.14), define 


H(x u x 2 ) = {1 +logGOi ， x 2 )}+ = [1 — / {- log Gi (xi), - log G 2 (x 2 )}]+. 

This // is a bivariate distribution function with translated GP margins Hi(xi)= 
{1 + log Gi(xi)}^. and copula Ch(mu « 2 ) = {1 — /(I — mi, 1 — « 2 )}+. In three or 
more dimensions, however, the formula H(x) = {1 + log G(x)} + does not, in gen¬ 
eral, lead to a valid distribution function, a counter-example being the case where 
the margins of G are independent. 

Equal margins 


In case all margins of F are equal to, say, F\, the previous reformulations 
of the domain-of-attraction condition (8.60) can be simplified somewhat. Let 
x 氺 =supjx g M : F\{x) < 1} be the right end-point of F\. By Pickands (1975), 
F\ g D{G\) for some univariate extreme value distribution function G\ with 
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0 < Gi(0) < 1 if and only if there exists a positive function cr(u) defined on 
u < such that [1 — F\{u + ct(u)x}]/{1 — F\(u)} —> — log G\(x) as w 个 x*，see 
Chapter 2. In that case, the normalizing constants may be taken equal to a n = cr(b n ) 
and b n = F 广 (1 — 1/n). 

Now let G be a variate extreme value distribution function with all margins 
equal to G\. Denote the lower end-point of Gi by ^; observe that ^ < 0 because of 
our assumption G(0) > 0. Using monotonicity and continuity, we can then show 
that (8.61) and hence (8.60) is equivalent to 


1 — F{u + cr(u)x \, …， m + a{u)xd] 


-logG(x), 


w 个 x* 


(8.69) 


for all x such that xj > q for all j = l,..., d. The latter criterion is a reformulation 
of results obtained by Marshall and Olkin (1983), who considered the more general 
case that the extreme value indices of the margins of F all have the same sign. 
For absolutely continuous distributions, Yun (1997) gives sufficient conditions for 
(8.69) in terms of convergence of certain conditional densities. We will come back 
to this in section 10.4 when studying the extremes of Markov chains. 

When the margins are equal, criteria involving exceedances over multivariate 
thresholds get simpler as well. With W as in (8.66), equation (8.65) is equivalent to 


P 



d 

\J Xi > u 


^ P[W G •], 


义*， 


and equation (8.67) to 

Xj > u P[W e • \ Wj > 0], w 个 X *， 

for all j = l,d. The latter formulation is used in Segers (2003a) to study the 
extremes of univariate stationary time series. 


P 


%• 


a{u) 


N q 


Exponent measure 

Condition (8.61) has an interesting interpretation in terms of exponent measures. 
Recall from (8.3) that G has an exponent measure, fi, concentrated on [q, oo) \ {q}, 
given by oo) \ [q, x]) = — log G(x) for x > q, where qj is the lower end¬ 
point of G j, the jth margin of G. Observe that, by (8.41)， — log G(x) is finite 

if x > ^ and infinite otherwise, so that /x(B) is finite for Borel sets B of [q, oo) 

bounded away from q. Also, define the measures \i n on [q, oo) \ {q} by 

^n(') = nP[X h n g •], where = a~ l (Xi - b n ) V q. (8.70) 

Since oo) \ [q, x]) = n{\ — F(a n x + b n )} for x e [q, oo], equation (8.61) 

may now be written in terms of the measures fi n and /x as > ix(B) as 

n ^ oo for every set B = [q, oo) \ [q, x] with x > q. Since both \i n and /i put 
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zero mass on [q, oo] \ [q, oo), a measure-theoretic argument now yields that (8.61) 
and hence (8.60) is equivalent to 

. v 

\i n converges vaguely to /x, notation /x n -> /i, on [q, oo] \ {q}, (8.71) 

to be interpreted as /x n (B) -> /x(B) as n —> oo for every Borel set B in [q, oo] \ 
{q} with compact closure and such that /x(dB) = 0, where dB denotes the topo¬ 
logical boundary of B. Observe that B C [q, oo] \ {q} has compact closure if and 
only if there exists x > q such that B C [q, oo] \ [q, x]. For more information on 
vague convergence of measures, see Resnick (1987) or Kallenberg (1983). 


Point processes 

Consider the following point processes on [0, oo) x [q, oo): 

oo 

N n (-) = J2 1 i(i/n,X itn ) e •}, 

i=l 

with Xi 9n as in (8.70). See section 5.9.2 for a short introduction on point processes. 
Recall from (8.71) that the domain-of-attraction condition (8.60) is equivalent to 

y 

> /x. By Proposition 3.21 of Resnick (1987), this is in turn equivalent to 

x> 

N n —> Poisson process with mean measure /x(dx). (8.72) 

A particular consequence, useful for statistical inference, is the following conver¬ 
gence of point processes on [q, oo): 

n 

E KXi, n e •) —> Poisson process with mean measure /x. (8.73) 


Discrete versus continuous index 


In the previous equations, the integer variable n can be replaced by a continuous 
variable t tending to infinity. For instance, with k」denoting the integer part of 
the real number t, equations (8.60), (8.61) and (8.71) can be extended to 

lim F^an^x + = G(x), (8.74) 

t^-oo 

lim t{l- F(a lti x + b ltl )} = — logG ⑻， (8.75) 

t—>oo 

= tP[X\^ t \ ^ •) ? ? —> 00 ， （ 8.76) 


the argument being that "|_ ，」一 ^ 1 as ? —> oo. 
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8.3.2 Convergence of the dependence structure 

When studying multivariate extremes, it is often convenient to separate the marginal 
distributions from the dependence structure. The fact that we are allowed to do 
so follows from the property that weak convergence of multivariate distribution 
functions is equivalent to weak convergence of (i) the marginal distribution func¬ 
tions and (ii) the copula functions, provided the margins of the limit distribution 
are continuous; see, for example, Deheuvels (1984). 

So let Fi,..., Fd be the margins of F and assume that for every j = 1,..., J 
there exist real sequences (a n j) n and (b n j) n with a n j > 0 and an extreme value 
distribution function G j such that 

Fj(a nJ Xj + b nJ ) Gj(Xj), n ^ oo. (8.77) 

Then which extra condition is needed on F in order to have (8.60) for some 
multivariate extreme value distribution function G with margins Gi,..., Gj? By 
the property in the previous paragraph, what is needed is convergence of the 
dependence structure, to be specified next. 

For convenience, we will assume that all the margins Fj are continuous. This 
has the particular advantage that each Fj(Xj) is uniformly distributed on (0, 1); 
here (X \,..., Xd) denotes a random vector with distribution function F. Also, 
for 0 < < 1, the four events {Xj < F^(uj)}, {Xj < FJ~(Uj)} ， {Fj(Xj) < uj], 

and {Fj(Xj) < uj] only differ on an event of probability zero and hence can be 
interchanged freely. Finally, the copula of F is unique and given by 

C F {u) = F{F^( Ml ),..., Fj~(u d )}, u e [0, l] d , (8.78) 

that is, Cf is the distribution function of (Fi(Xi),..., F^(Xj)). 

Copula convergence 

Let X\, X 2 ,... be a sequence of independent random vectors with distribution 
function F. The copula of F n , the distribution function of the sample maximum, 
M n = X x vX n ,is 

C F n(u)^ F"{( 吖广 ( 《 !) ， ... ， ( 打疒 (U d )} 

=.. ， F^{u x j n )} = C;( M ; /n ’’ u] /n ), 

for u e [0, 1]^. 

Now let G be an extreme value distribution function with margins Gj and 
copula Cg. We obtain that F g D(G) if and only if (8.77) together with 

lim C n F (u\ /n , ... ， uj n ) = C G (u), u e [0, if. (8.79) 

n->oo 1 u 

Since the limit copula, Cq, is continuous, the above convergence holds uniformly 
in w g [0, 1]^. Hence, in (8.79)，we can replace the discrete variable n by the 
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continuous variable t: 

lim CUm! /, ,...,mJ /, ) = C g (m), u e [0, l] d . (8.80) 

t^-oo 1 

By the stability relation (8.53) of Cq, we obtain from (8.80) the approximation 
Cf(u) ^ Cg (m) for u such that all Uj are sufficiently close to one. Writing Cq in 
terms of the stable tail dependence function / as in (8.52) and substituting Fj(xj) 
for Uj yields the approximation 

F(x) ^ exp[-/{- log … ，一 log F d (x d )}], (8.81) 

for x such that all Fj(Xj) are close to unity. 


Reduction to standard Frechet or standard Pareto margins 

Alternatively, we can transform the random vector X in such a way that its margins 
become standard Frechet: define the random vector with distribution function 
by 


= — 1 /log j = h ... ， d ， 

w w (o.oz) 

F*(z) = 广 (e- 作 )，…，^^- 1 幻 )}， 

where 0 < Zj < oo for j = l,d. Conversely, F can be obtained from and 
its margins Fj through F(x) = 1/log Fi(xi),..., — 1 /log Fd(xd)}. 

The margins of are all standard Frechet, while its copula is the same as 
the copula of F. Since the standard Frechet distribution is in its own domain of 
attraction, copula convergence as in (8.79) is equivalent to e D(G^), that is, 

lim Fi(tz) = GM), (8.83) 

t-^-OO 


where G* is obtained from G after a transformation to standard Frechet margins 
as in (8.5). Alternative formulations of (8.83) are 


lim f{l - F^(tz)} = - log G*(z), z e [0, oo], (8.84) 

t—oo 

as well as 

1 — _F* 0 z) 〜 -log G^(tz) 

〜 1 — G^(tz), 0 < z < oo; t ^ oo. (8.85) 

Taking (8.85) as an approximation for large t leads again to the approximation 
(8.81). ‘ 

With ^ = (1,..., 1) g W 1 , equation (8.84) implies 


lim 


^{tz) 


>oo l - F^(te) 


-logG“z) 
-log G^(e) 


z e [0, oo], 


( 8 . 86 ) 
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that is, 1 — _F* is multivariate regularly varying on the cone ( 0 , oo) (Resnick 1987); 
see also section 8.4. Conversely, it is not hard to see that ( 8 . 86 ) also implies 
F 木 e D(G*). 

Finally, we could equally well have transformed to standard Pareto rather than 
to standard Frechet margins, that is, (8.79) is still equivalent with each of (8.83), 
(8.84), (8.85), or ( 8 . 86 ) if, rather than (8.82), we would have put 


X,j = 1/ll-FiiXj)}, j=l ， ... ， d, 
F,(z) = F{F^(l-l/zi),...,F^(l-l/z rf )}, 


(8.87) 


for 1 < Zj < oo, j = l, ... ,d. 


Tail dependence function convergence 

Closely related to the copula of F is its tail dependence function 

D f (u) = 1 - Fl(\ - u d )}. (8.88) 

Observe that Df(u) = 1 — C/r(l — wi,..., 1 — Ud) = P[[J d j =l {Fj(Xj) > 1 — 
Uj}] and 1 — F(x) = Dp{\ — F\(x \),..., 1 — Fd(xd)}. Using (8.84) and Df(u )= 
1 — ... ， l/ud) with as in (8.87) (Pareto margins), we find that (8.79) 

is equivalent to 

lim 厂 化 /^㈣ ） =/(r), v > 0. (8.89) 

A few equivalent formulations of (8.89) are 

l{v) = lim 厂 1 {1 — C F (1 - svu ... ， 1 一 sv d )} (8.90) 

= 尸 [3 j = 1 ，…， d : Fj(Xj) > 1 — svj] 

= Fj(Xj)}] > t] 

for v > 0. Since the convergence in the previous equations is locally uniform in 
v g [ 0 , oo), we may replace 1 — svj by any function of the form 1 — svj + o(s) 
as 5 丄 0, for instance, (1 — s) v j or q~ sv j . In the bivariate case, a necessary and 
sufficient condition is 

limi-'tl - C F {\ - s(l - ?), 1 - st}] = A(t), t e [0, 1], (8.91) 

where A(t) = 1(1 — t, t) is the Pickands dependence function of G^. A useful 
consequence is 

lims—Ml — C F (l — ■?，1 一 s)} = 2A(l/2) = 9, (8.92) 

■40 

the extremal coefficient of (8.56). 
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Equation (8.89) and its reformulations point the way to non-parametric estima¬ 
tion of / from observations from F (section 9.4.1). Moreover, since sl{v) = l(sv) 
for s > 0, equation (8.89) has the interesting interpretation 


1 — F(x) ^ l{\ — …， 1 — Fd{xd)} (8.93) 

provided all 1 — Fj(xj) are sufficiently small. Approximation (8.93) is in fact a 
first-order expansion of the one in (8.81). However, (8.81) is preferable over (8.93) 
as the latter approximation undervalues the probability of joint exceedances in 
different margins: for instance, if d = 2 and l(v\, V 2 ) = v\ V 2 (independence), 
then P[X\ > x\, X 2 > X 2 ] ^ P[X\ > x\]P[X 2 > X 2 ] under (8.81) while P[X\ > 
x\, X 2 > X 2 ] ^ 0 under (8.93). Also, in three or more dimensions, the right-hand 
side of (8.93) does in general not define a valid distribution. 


Exponent and spectral measure 

Let and be as in (8.82) or (8.87). The condition g D(G^) can also be 
linked to the exponent measure /x* and the spectral measure S of G*, see ( 8 . 8 ) 
and (8.16). First, by (8.76), e D(G*) is equivalent to 

^,(.) = tP[r l X,e-]A (8.94) 


on [0, 00 ] \ {0}. Taking (8.94) as an approximation for large t leads to a recipe for 
a non-parametric estimation of the exponent measure ^ in section 9.4.1. 

Second, let T be the transformation to pseudo-polar coordinates as in (8.15) 
determined by two norms || • ||i and || • H 2 on R d . Applying T to 厂 1 Z* in (8.94) 
and using (8.17), we find that g D(G^) is equivalent to 


rP[(r 1 ||Z*|| 1 , X*/||Z*|| 2 ) e •] 4 r- 2 dr5(do>), r ^ oo, (8.95) 
on (0, oo] x S. Equation (8.95), and hence g D(G^), is equivalent to 

^[||^*lli X*/l|X*lb e •] A 5(.) (8.96) 


on S, which, in turn, is equivalent to 

尸 [mill >0 〜厂 1 只 S) I 

PIXJWX.h € ■ \ IIH >t] ^ S(-)/S(S) I 


(8.97) 


(de Haan 1985). Equations (8.96) and (8.97) give an interpretation of the spectral 
measure S in terms of the distribution of the angular component of in the region 
where its radial component is large. As for the exponent measure, interpreting limits 
as approximations for large t points the way to non-parametric estimators of S in 
section 9.4.1. 
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Point processes 

In terms of point processes, we have, by (8.73), g £)((7*) if and only if 

n 

1 [n~ x X^i g •) Poisson process with mean measure /x *， (8.98) 

i=l 

where the are independent copies of (de Haan 1985). This point-process 
characterization can be used for likelihood-based statistical inference on the spectral 
measure S in the context of a parametric model, see section 9.4.2. 


Asymptotic independence and complete dependence 


The two boundary cases within the class of dependence structures of multivariate 
extreme value distributions are those of independence and complete dependence. 
Although the latter is merely of academic importance, the former is highly rele¬ 
vant in practice as many multivariate distributions, including the non-degenerate 
multivariate normal, lie in the domain of attraction of a multivariate extreme value 
distribution with independent margins, a result dating back to Sibuya (1960); see 
also Example 9.3. Because of this, section 9.5 is devoted to more refined models 
in case of asymptotic independence. Here, we restrict ourselves to some character¬ 
izations of the domains of attraction of the two cases. 

Asymptotic independence. A multivariate distribution function F with copula Cf 
is called asymptotically independent if Cf satisfies (8.79) with the independent 
copula as limit, that is, Cg(w) = u\... Ud for u e [0, 1]^. In terms of the tail 
dependence function D = Dp defined in ( 8 . 88 )，asymptotic independence can be 
written as 


\ims~ [ D(sv) = I?! + • • • + v e [ 0 , oo), (8.99) 

see (8.89). If additionally each marginal distribution Fj of F is in the domain 
attraction of an extreme value distribution G ; , then F is in the domain of attraction 
of the extreme value distribution G given by G(x) = G\(x\) - - - Gjfe). 

Berman (1961) already showed that a random vector (Xi,..., Xd) is asymp¬ 
totically independent as soon as all pairs (Xi, Xj) with i ^ j are asymptotically 
independent. Let A) be the bivariate tail dependence function of the pair (X“ Xj); 
observe that D"(Vi ， Vj) = D(v) where the kth coordinate of v is if k e {/, j] 
and zero otherwise. Elementary Bonferroni inequalities give 

wi + • • • + wj > D(u) > wi + • • • + wj — {ui Uj — Dijiui, Uj)}. 

l<i<j<d 

Hence s~ { Dij(sVi, svj) u/ + 〜as 5 1 | 0 for all i ^ j and all (%•， Vj) G [0, oo) 2 
indeed implies asymptotic independence. 
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Observe that the pair (X“ Xj) is asymptotically independent if 
lim l s i_ 1 P[F i (X/) > 1 - svi, Fj(Xj) > 1 - svj] = 0, (vi, vj) e [0, oo) 2 . 

By monotonicity, it is sufficient to have the stated convergence for a single 
(Vi, Vj) G (0, oo) 2 ; in particular, the pair (Xi, Xj) is asymptotically independent if 

^(Vi, vj) e (0, oo ) 2 : \im P[Fi(Xi) > 1 - sv t \ Fj(Xj) > 1 - svj] = 0. 

Typically, this result is stated with Vi = l = vj. In conjunction with the previ¬ 
ous paragraph, we obtain that the random vector (X\, ..., Xd) is asymptotically 
independent if 

\imP[Fi(Xi) > l-s\ Fj(Xj) > l-s] = 0, \<i < j <d. (8.100) 

40 — — 

In terms of the copula of the pair (Xi, Xj), that is, Cij {ui , uj) = P[Fi(Xi) < 
u“ Fj(Xj) < uj], asymptotic independence can be written as 

(l — s, l — s) = \ — 2s o(s), s 丄 0; 1 < / < j < d. 

Takahashi (1994) also showed that asymptotic independence arises as soon as 

彐 v e (0, oo) : \im s~ [ D(sv) = 屯 + ... + (8.101) 

Necessity of (8.101) follows from (8.99). But (8.101) is also sufficient: From the 
inequalities 

d 

s~ l D(sv) < s~ l Dij(sVi,sVj) + ^ 叫， I <i < j <d, 



it follows that (8.101) implies s~ l Djj(svi, svj) —> Vi + as 5 ^ 0 for all l < i < 
j < d, whence indeed (pairwise) asymptotic independence. 

Asymptotic complete dependence. In some sense opposite to the case of asymp¬ 
totic independence, a multivariate distribution function F with copula C/r is called 
asymptotically completely dependent if Cf satisfies (8.79) with the completely 
dependent copula as limit, that is, Cq (w) = m i a • • • a for m g [0, 1]^. In terms 
of the tail dependence function D = Dp defined in ( 8 . 88 )，asymptotic complete 
dependence can be written as 

lim s~ [ D(sv) = v\ v • • • v Vb, v g [ 0 , oo), 

see (8.89). If additionally each marginal distribution Fj of F is in the domain 
attraction of an extreme value distribution Gj, then F is in the domain of 
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attraction of the extreme value distribution G given by G(x) = G\(x\) A • • • A 
Gd(x d ). 

Takahashi (1994) showed that asymptotic complete dependence arises as 
soon as 

彐 0 < w; < oo : lims—... ， sw) = w. (8.102) 

To see that the above condition is indeed sufficient, take v g [0, oo) \ {0} and set 
v = v\ V • • • V Vd > 0. Then 

i; < s~ l D(sv) < s~ l D(sv, • • •, sv) 

=(v/w)(sv/w)~ l D{(sv/w)w ,, (sv/w)w}. 

Since the right-hand side converges to v, we obtain indeed 5 ■ 一 1 Z)(5V) —> i; as 5 1 ^ 0. 

Also, pairwise asymptotic complete dependence implies asymptotic complete 
dependence: The pairwise case entails s~ l P[Fj(Xj) > 1 — ^ > F/(X/)] 0 as 

5 ^ 0 for all 1 < / < j < d and thus 

1 < ... ， 5 1 ) 

d 

< s^PiF.iX,) > l- S ] + J2 P[Fj(Xj) >l-s> &(&)] — 1 ， 

7=2 

which by (8.102) forces asymptotic complete dependence. 


8.4 Additional Topics 

We collect some topics that did not find their way into the main part of the text. 


Multivariate regular variation 


A rather popular condition implying that a distribution is in the domain of attrac¬ 
tion of a multivariate extreme value distribution is multivariate regular variation. 
We have already encountered it in ( 8 . 86 ) as a necessary and sufficient condition 
for the dependence structure of a distribution to be in the domain of attraction 
of an extreme value dependence structure. More generally, let F be a J-variate 
distribution function with support [0, oo). Put e = (1,..., 1) e We say that 
F is regularly varying on (0, oo) if there exists a function A : (0, oo) —> (0, oo) 
such that 


lim 

r-»oo 


1 — F(tx) 
1 — F(te) 


—入 ( 文）， 


x e (0, oo). 


It follows that there exists a measure v on [0, oo) \ {0} such that 入 (x) = y([0, oo) \ 
[0, x]) for all x > 0. Observe that ( 8 . 86 ) says that is regularly varying on (0, oo) 
with limit measure v(.) = oo) \ [ 0 , e]). 
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Most properties we discovered for /x* extend also to v. For instance, there must 
exist 0 < a < oo such that v(t •) = for all 0 < ? < oo. For t n such that 

1 — F(t n e) 〜 n~ { as n ^ oo, we get F n (t n x) -> exp{— 入 (x)}, an extreme value 
distribution with Frechet margins. Also, v admits a spectral decomposition of the 
same kind as we found for /x^ in section 8.2.3. As in (8.97), the normalized spectral 
measure can be interpreted as the limiting distribution of the angular component of a 
random vector with distribution function F given that its radial component is large. 
A detailed account of multivariate regular variation can be found in Resnick (1987) 
and Mikosch (2004); see also Bingham et al. (1987). Far-stretching generalizations 
are developed in the monograph by Meerschaert and Scheffler (2001). 

Now suppose that F is an absolutely continuous variate distribution func¬ 
tion with density / supported on [0, oo). Sufficient conditions in terms of / 
for F to be regularly varying on (0, oo) are stated in de Haan and Resnick 
(1987). These are useful as most multivariate models are defined in terms of 
their densities rather than their distribution functions. Typical examples where 
the conditions can be applied are the (restriction to [0, oo) of the) multivariate 
^-distribution and F-distribution. In combination with ( 8 . 86 ), the conditions can 
serve as a tool to prove that the dependence structure of some absolutely contin¬ 
uous distribution is in the domain of attraction of an extreme value dependence 
structure. 


Special classes of distributions 

For certain non-parametric classes of distributions, the domain-of-attraction condi¬ 
tions in section 8.3 can be worked out explicitly. For instance, Hult and Lindskog 
( 2002 ) study the multivariate extremes of elliptical distributions, focusing in par¬ 
ticular on the limiting spectral measure. 

Alternatively, Caperaa et al (2000) study the class of bivariate copulas given by 


C(u, v) = (p~ { + (j){v)}A 


0 ⑻ ] 

0 ⑻ + 0 ⑻ j 


(u, V) e [0, l] 2 . 


Here A is a Pickands dependence function, </> : (0, 1] —> [0, oo) is convex and 
decreasing and verifies 0 ( 1 ) = 0 , the function 0— 1 is the inverse function of 0 , 
and we employed the conventions 0 ( 0 ) = lim M ^o 0 (m) and 中 - 1 (s) = 0 if ^ > 0 ( 0 ). 


The class unifies the families of bivariate extreme value copulas (</> = — log) and 
Archimedean copulas by Genest and Mac Kay (1986) (A = 1), whence the name 
Archimax copulas. Within the class, it is easy to construct non-trivial examples of 
copulas in the domain of attraction of any given bivariate extreme value copula. 


Other extreme-related quantities 

Rather than the coordinate-wise maximum or the exceedances over a high multi¬ 
variate threshold, other quantities related to the extremes of a multivariate sequence 
have been studied in the literature as well. 
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Cheng et al. (1995), for instance, study multivariate intermediate order statis¬ 
tics, defined as follows. Let X (, i = 1,..., w be independent, identically distributed 
d-dimensional random vectors. For j = 1, ..., d, let X(i)j < ••- < X( n )j be the 
ascending order statistics corresponding to the observations X\j ,..., X n j. For 
every j = l, d, let (k n j) n be an intermediate sequence of positive integers, 
that is, k n j oo and k n j/n —> 0 as n ^ oo. Suppose also that all k n j grow at 
the same rate. Cheng et al. (1995) then find the asymptotic distribution as w ^ oo 
of the sequence of vectors Z ㈨)， „ = (X( kflA ),n ,.. • ， X (kn d) , n ). 

Records can also be studied in the multivariate case, although a natural defini¬ 
tion of multivariate records is not obvious because of the lack of a natural ordering 
for multivariate observations. The principle of marginal ordering suggests the fol¬ 
lowing definition: X n is a record in the sequence X\,..., X n if X n > X“ 
that is, if there is a record simultaneously in all coordinates. The asymptotic distri¬ 
bution of the sequence of such records is the topic of Goldie and Resnick (1995) 
and the references therein. Alternatively, in the context of Gaussian processes, 
Habach (1997) defines X n to be a record as soon as there is a record in one of the 
coordinates, that is, if X n j > V?:/ ^i,j f° r some j = l, ..., d. 

A concept that is inherently multivariate is that of concomitants or induced 
order statistics. For instance, let (Xu, Xa), i = 1 ， .. •， be a sample of bivariate 
random pairs and let X(i)，i < ••- < X ⑻， i be the ascending order statistics in the 
first coordinate. Then the value of the second coordinate of the pair of which the 
first coordinate is equal to is called the concomitant of that order statistic and 
is denoted by 知 ] ’ 2 . For example, X [„]，2 is the second coordinate of the pair with the 
largest first coordinate. The distribution of concomitants of extreme order statistics 
is investigated in David (1994) and Nagaraja and David (1994). Ledford and Tawn 
(1998) focus on the concomitant of the largest order statistic in case the marginal 
bivariate survivor function is bivariate regularly varying, see section 9.5. In par¬ 
ticular, they give an asymptotic expansion for the tail function of that concomitant 
and find the asymptotic probability that the pair of coordinate-wise maxima is an 
actual observation, that is, P[X [„] ? 2 = X ⑻， 2 ]. 


Rates of convergence 

Recall from Chapters 4 and 5 that in one dimension, because of slow convergence 
in the domain-of-attraction condition, estimators of the tail of a distribution some¬ 
times suffer from a substantial bias. A similar problem may arise in higher dimen¬ 
sions, an extra issue being the rate of convergence of the dependence structure. 

Omey and Rachev (1991) and Falk and Reiss (2002) investigate the rate of 
convergence of the copula of the sample maximum to the limiting extreme value 
copula (8.79) with respect to the uniform metric, whereas de Haan and Peng (1997) 
employ the stronger total variation metric. Alternatively, Kaufmann and Reiss 
(1995) consider the rate of convergence of certain point processes of exceedances 
to the limiting Poisson process, corollaries being rates of convergence for the 
joint distributions of upper order statistics, although their error term appears to 
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be sub-optimal. Finally, Nadarajah (2000) gives asymptotic expansions for the 
convergence of spectral densities in (8.96). 

More general settings than i.i.d. sequences 


Up to now, we always started from a sequence of independent, identically dis¬ 
tributed random vectors. This setting can be generalized in a number of ways. 

A first possibility is to drop the assumption of stationarity. For instance, Hiisler 
(1989b), building on work by Gerritse (1986), characterizes the class of limit 
distributions of normalized maxima of sequences of independent, non-identically 
distributed random vectors and states a number of properties of the dependence 
structure of the possible limit laws. Moreover, Hiisler (1989a) gives conditions 
under which the extremes of a general non-stationary, possibly dependent sequence 
of random vectors have the same asymptotic distribution as the corresponding 
sequence with independent random vectors. 

Alternatively, one can drop the assumption of independence. Hsing (1989) 
and Hiisler (1990) examine the asymptotic distribution of normalized maxima of 
sequences of general stationary sequences of random vectors; see also section 10.5. 
The asymptotic distribution of point processes of exceedances and vectors of 
extreme order statistics for multivariate stationary normal sequences is the topic 
of Wisniewski (1996). 

Finally, interesting results can also be obtained for a triangular array {Xi n : n = 
1,2,...; i = l,... ,n] of independent J-dimensional random vectors. Hiisler and 
Reiss (1989) consider the case where every row X\ n ,, X nn consists of centred, 
unit-variance normal random vectors with correlation matrix p n depending on n. 
For instance, in the bivariate case, they find that if (1 — p n ) \og(n) A, 2 g [0, oo] 
as n —> oo then the suitably normalized maximum M n = V?=i Xi n converges 
weakly to a parametric family of multivariate extreme value distributions with 
dependence structure depending on 入 ， see section 9.2. More general triangular 
arrays are considered in Hiisler (1994). 

8.5 Summary 

For the reader’s convenience, we provide a summary of the essential facts to be 
remembered from the theory of multivariate extremes. 

We work in ^/-dimensional space. The distribution functions G with non¬ 
degenerate margins that can arise as the limit in lim^-^oo F n (a n x + b n ) = G(x), 
where F is a variate distribution function and a n and b n are arbitrary vectors, the 
entries of a n being positive, are called multivariate extreme value distribution func¬ 
tions. We say that F is in the (max-)domain of attraction of G. The interpretation 
is that G is the limit distribution of the properly normalized component-wise max¬ 
imum of an independent sample from F as the sample size tends to infinity. The 
class of extreme value distributions coincides with that of max-stable distributions. 
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The margins, G ; , of a max-stable distribution function G are univariate extreme 
value distribution functions themselves. They have the corresponding margins, Fj, 
of F in their respective domains of attraction. In order to study the dependence 
structure of G, we may, without loss of generality, standardize the margins of G 
to the standard Frechet distribution by G*(z) = G{GJ~(e _1 / Z1 ), …， 
for z G [ 0 , oo]. 

The function V* = — log G* satisfies the homogeneity relation = \4(z) 

for 0 < < oo. Moreover, there exists a measure,on [0, oo) \ {0}, the expo¬ 

nent measure, such that V^(z) = ^({x > 0 : x ^ z}). The exponent measure /x* 
inherits a similar homogeneity property from V*. 

In polar coordinates, the measure 〜factorizes as a product measure in the 
radial and angular components. More specifically, identifying z with (r, co), where 

r = zi~\ - h Zd is the ’’radius” and oo = (z\/r, ..., z^/r) is the ’’angle”，we have 

ix(dz) = r~ 2 drH(dco). Here, the spectral measure // is a finite measure on the unit 
simplex, 5^ = {w > 0 : + ... + = 1}. The only requirement on a positive 

measure H on Sd to be the spectral measure of an extreme value distribution is 
that f Sd cojH(dco) = 1 for all j = 1, ..., J. Alternative definitions of the spectral 
measure are possible, starting from a different choice of the radial and angular 
components. 

The stable tail dependence function l is given by l(y) = V^(l/vi, ..., l/vd) 
for 0 < v < oo. It satisfies the homogeneity relation l(sv) = sl(v) for 0 < 5 1 < oo 
and is connected to the extreme value distribution G and the spectral measure H 
through 

G(x) = exp[-/{-logGi(xi), ...， - log G d (x d )}], 
r d 

/⑻ =/ \/(ojjVj)H(d(o). 

Jsd 7=1 

The partial derivatives of l or V* can be used to compute the densities of the 
spectral measure H on the 2 d — l faces of the unit simplex Sd. 

The two extreme cases for the dependence structure of an extreme value dis¬ 
tribution are those of independence and complete dependence. In general, the 
dependence structure lies between these cases. In particular, extreme value distri¬ 
butions always exhibit positive association. In case of independence, the spectral 
measure H consists of unit point masses at each of the d vertices of the unit sim¬ 
plex Sd and the stable tail dependence function is given by l(v) = v\ Vd. 

Independence arises as soon as all pairs are independent. In case of complete 
dependence, H reduces to a single point mass of size d at the centre-point of Sd, 
and l(v) = v\ v • • - v Vd- 

In two dimensions, all information on the dependence structure is contained in 
Pickands dependence function A(t) = 1(1 — t, t), where ^ G [0, 1]. It is convex and 
satisfies t V (l — t) < A(t) < 1. These are the only restrictions for a function A 
to be the Pickands dependence function of a bivariate extreme value distribution. 
The lower and upper boundaries on A correspond to complete dependence and 
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independence, respectively. Identifying the unit simplex S 2 with the unit interval 
[0, 1], we can easily obtain from A the point masses of // on 0 and 1 and its 
density h on ( 0 , 1 ). 

For a given extreme value distribution function G, we formulate various equiv¬ 
alent conditions for a distribution function F to lie in its domain of attraction. A 
number of these conditions are in terms of the limit distribution of excesses over 
a high multivariate threshold. Equally useful is to tear the domain-of-attraction 
condition apart into two pieces: first，the margins of F must lie in the (univariate) 
domain-of-attraction of the corresponding margins of G, and second, informally 
stated, the dependence structure of F must lie in the domain of attraction of the 
dependence structure of G. 

A particularly useful interpretation of the domain-of-attraction condition is the 
approximation 


F(x) ^ exp[—/{- log Fi(xi),... ， - log F d (x d )}] 

for x such that 1 — Fj(xj) is small for all j = 1 ， ... ， d, and with / the stable tail 
dependence function of the limiting extreme value distribution. Combined with 
a Generalized Pareto model for the marginal tails, this leads to (semi-)parametric 
models for F in the regions of the form [m, 00 ) for high multivariate thresholds u. A 
related, slightly less accurate approximation is 1 — F(x) ^ /{I — F\{x \), …， 1 — 
Fd(xd)}. Alternatively, we find a condition in terms of convergence of certain point 
processes to a non-homogeneous Poisson process with intensity measure 


8.6 Appendix 

8.6.1 Computing spectral densities 

We give a proof of (8.34) expressing the densities of the spectral measure H on 
the faces of the unit simplex Sd in terms of the derivatives of V* = — log (7*. Let 
z > 0. By the inclusion-exclusion formula, 

V^(z) = /^({x > 0 : > Zj for some j = l,... ,d}) 

= (- 1 , 卜 v(w >0 : Xj > Zj for all j e b)). 

Now let a = { 71 ,, j m } be a non-empty subset of {l, ..., d], and let D a be the 
differential operator 3 m /(3z ; - 1 - - - dzj m ). Applying D a to both sides of the previous 
equation, we only retain those terms for which b contains a, that is, 

DJ * ⑵ 

= ^2 (- 1 ) 网 — 1 > 0 : > zj for all j e b}). 

aGbG{l,...,d} 
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Denote a c = {1 ，..., d}\a. We can split the sum above in two parts: the term 
corresponding to b = a, and the terms corresponding to b = aU b f with 0 ^ b f C 
a c . We get 

D a V^Z) 

=> 0 : > Zj for all j e a}) 

+ (-l) |a| D fl (-l) |fcH V ({^ > 0 : Xj > zj for all j G aUb f }). 

0^b f Ca c 

Applying the inclusion-exclusion formula again, we get 


D a V^z) 

= (—iy a ^~ l D a fM^{x >0.Xj> Zj for all j g a}) 

+ (—> 0 : Xy > Zj for all j e a and some j g a c }) 

={—\)^~ x D a iJi(\x >0.Xj> Zj for all j ^ a; xj < Zj for all j G a c }). 

Now if we let Zj -> 0 for all j g a c , we get 


lim DaV^iz) 

Zj^-0 

i 私 

=(—l) |a|-1 D a /x({x > 0 : x 7 - > Zj for all j g a; Xj = 0 for all j g a c }). 

Let h a be the density of H on Sd, a - Using the spectral decomposition (8.17) and 
the multivariate change-of-variables formula, we get 


/^({jc > 0 : > Zj for all j g a; Xj = 0 for all j e a c }) 


ha 




^m — \ 


，z h 


乙 jm 




(I>) 


- 制 +i) 


dx ： i... dx n 


Apply the operator D a on both sides of this equation to get (8.34). 


8.6.2 Representations of extreme value distributions 

Let G be a J-variate extreme value distribution function with margins G j for 
j = 1,..., J. We have seen equivalent descriptions of the dependence structure 
of G in terms of, amongst others, the simple max-stable distribution function 
G* = exp(—V*), the exponent measure /x*, the stable tail dependence function /, 
the spectral measure S w.r.t. two norms || • ||/ (/ = 1,2) on M. d , the copula Cg, 
and, in the bivariate case, Pickands dependence function A. For easy reference, we 
collect here the connections between these various descriptions. 
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Formulas for G 

G(x) = G*{-l/logGi(xi),-1/log G d (x d )} 

=exp {-m*([ 0, oo] \ [0, (-1/log GKxO,, -1/log G d {x d ))})] 
=exp [— /{— log Gi(xi), — log G d (x d )}] 




[-^-logGyCx^hcdw) 

exp 

/ A 





— Cg{Gi(x\), Gd(xd)} 


Formulas for 


G,(z) = G{Gt(e- 




Gr(e- x ^)} 


exp{-^([0, oo] \ [0, z])} 
exp{-/(l/zi, … ， \/zd)} 


exp 


C G (e 


⑴ j 


-l/zi 


,ll<w||l Zj, 


5(do>) 


Formulas for l 


/(!；) =-log• • • ， G^(e-^)} 
=-logG^U/w ， ... ， l/v d ) 

= /x*([0, oo] \ [0, (l/vu 1/%)]) 


d 




， S ) =1 Vll^lll 
-log C G (c~ Vl 


Vj \ 5(da>) 

， e — 叹） 


Formulas for S 


S(B) = ^({z e [0, oo) : ||z||i> 1, z/||z ||2 e B}) 


L 


舅 2 


e B 


114 


5, (do/) 
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Formulas for Cg 

C G (u) = GIGtiux),..., G^(u d )} 

= G^(-l/logu u …， - l/logu d ) 

= exp{-/x*([0, oo] \ [0, (-1/logwi, -l/logu d )])} 
= exp{-/ (- logwi,-logwj)} 



Bivariate case: Formulas in terms of A 


G(x\,X 2 ) = exp 


C*(zi,z 2 ) = exp 


log{G 1 (xi)G 2 (x 2 )}A 


， log{G2fe)} 
^log{G 1 (x 1 )G 2 (x 2 )}7J 


- + 丄 A 丄 

、Zl Z2 / \Z\ +Z2, 

V2 \ 


Kvu V 2 ) = (^1 + V 2 )A 

\vi + V 2 J 

//([0, O)]) = /X*({(Zl, Z2) € [0, oo) 2 : Zl +Z2 > 1, Zl/(z\ +Z2) < Oj}) 
1 + A\o)) ifo^e [0, 1) 


2 


S(B) 


/o,i] 


if co = l 

(co, 1 — co) 


eB\\\(co,l-co)hdH([0, co]) 


Cg (wi, W 2 ) = exp 
Formulas for A 


\og(u\U2)A 


log(>2) 

log(wiw 2 ) 


Ait) = -logG[Gr{e- (1 - J) }, cm] 
= 一 log 04(1 —o'r 1 } 

= M*([0, OO ] 2 \ [0, ((1— f)- 1 〆- 1 )]) 


/(l-r,f) 


CO\ 


II ⑽，出 2) II 


■(l-o V 


Ci>2 


II ( 出 1 ， 出 2) II 】 


S(d(couO) 2 )) 


= -logC G {e— (w) ， e—0 
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STATISTICS OF MULTIVARIATE 
EXTREMES 


co-authored by Bjorn Vandewalle 
9.1 Introduction 


Given a sample of multivariate observations, assumed to be generated by independent 
and identically distributed random vectors, how to estimate the tail of the underlying 
multivariate distribution? In particular, how to estimate with good relative accuracy 
the probability of an event in a region of the sample space with none or only very few 
of the observations? As for statistics of univariate extremes, this calls for generally 
applicable models based on which it is justified to extrapolate outside of the sample 
region. If the interest is in the occurrence of joint extremes in several coordinates, 
then proper modelling of the marginal distributions should be complemented by a 
correct assessment of the dependence structure at extreme levels. 

A successful class of models and inference techniques is based on the multivari¬ 
ate extreme value distributions, studied extensively in Chapter 8. The argument in 
favour of these distributions is summarized by the property that, as in the univari¬ 
ate case, the tail of a distribution in the domain of attraction of an extreme value 
distribution can be approximated by the tail of that extreme value distribution itself. 

As the class of multivariate extreme value distributions does not admit a finite¬ 
dimensional parametrization, a quite popular approach is to perform inference 
within a well-chosen parametric sub-model. A number of such models have been 
shown in various case studies to be particularly successful in combining analytical 
tractability with practical applicability. Of course, new situations may ask for new 
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models, so it is useful to have tools to construct parametric families of multivariate 
extreme value distributions. All this is treated in section 9.2. 

As for the univariate case, historically the first statistical methods for multivariate 
extremes follow the annual maximum approach (Gumbel and Goldstein 1964). The 
approach consists of partitioning a sample of multivariate observations into blocks, 
each typically corresponding to one year of observations, and fitting a multivariate 
extreme value distribution to the sample of component-wise block maxima. The 
crucial point here is to estimate the multivariate extreme value dependence structure 
or copula. Section 9.3 describes both parametric and non-parametric techniques to 
do this. 

Reducing the sample to a single observation per year disregards the possibility 
of a given year to witness several relevant events. More efficient is to use all data 
that are in some sense large, for instance all observations for which at least one 
coordinate exceeds a high threshold, which may differ according to the coordinate. 
The modelling assumption, motivated by the domain-of-attraction conditions of 
section 8.3, is that the dependence structure of the underlying distribution may at 
extreme levels be approximated by a max-stable dependence structure. Again, the 
choice is between parametric inference within a subclass or general non-parametric 
techniques, most of which are motivated by the spectral decomposition of a mul¬ 
tivariate extreme value distribution. 

Both the annual maximum approach and the threshold approach are founded in 
the paradigm of multivariate extreme value distributions, motivated by the theory 
in Chapter 8. The resulting models are therefore restricted to either perfect indepen¬ 
dence or asymptotic dependence. Neither of these may be satisfactory for cases of 
asymptotic independence with positive or negative association at penultimate thresh¬ 
olds, such as, for instance, the bivariate normal with positive or negative correlation. 
This calls for more refined models for the joint survivor function of a random vector 
in case of asymptotic independence, and these are presented in section 9.5. 

We conclude the chapter with a number of additional topics in section 9.6 and 
a summary in section 9.7. 


Loss-ALAE data 


We will illustrate the methods in this chapter on the Loss-ALAE data set compris¬ 
ing 1500 liability claims in an insurance set-up, see Figure 1.15 in section 1.3.3. 
Each claim consists of a loss or indemnity payment and an Allocated Loss Adjust¬ 
ment Expense. ALAEs can be seen as additional costs for the insurance company, 
such as lawyers’ fees and investigation expenses resulting from individual claim 
settlements. The scatterplot of the two variables in Figure 9.1(a) suggests a strong 
relationship between losses and other expenses at intermediate levels, as confirmed 
by the value of the correlation coefficient, 0.4. 

Starting from (Loss ， ALAE) observations (x,i, x/ 2 ), i = 1，…， we can obtain 
an informal, margin-free picture of dependence by transforming the data to have uni¬ 
form (0,1) marginal distributions using the (modified) empirical marginal distribution 
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Figure 9.1 Scatterplot of ALAE versus Loss: (a) original data (log-scale), (b) data 
transformed to uniform (0, 1) margins. 
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functions. For i = 1, ..., n and j = 1,2, define 


k=l 


(9.1) 


If we consider the (x/i, xa) as being realizations of independent random variables 
with common distribution function F, then the (un, u^) can be interpreted as real¬ 
izations from the copula, C, of F. The scatterplot of the (w/i, ua) in Figure 9.1(b) 
suggests some dependence between losses and ALAEs at levels where both are high. 

The case studies in this chapter were partially performed with the R package 
‘evd’ by Alec Stephenson, freely available from cran. r-pro ject . org, includ¬ 
ing some routines by Chris Ferro, and with routines written by Bjorn Vandewalle. 

9.2 Parametric Models 

Recall from section 8.2 that the family of J-variate extreme value distributions 
is indexed by a positive measure on the unit simplex Sd satisfying a number of 
moment restrictions. In particular, unlike the univariate case, the family does not 
admit a finite-dimensional parametrization. As a consequence, we lose the comfort 
of parametric likelihood machinery, which guarantees efficient estimation, easy 
assessment of estimation uncertainty, hypothesis testing and inclusion of covariate 
information. This is a major setback. 

In order to be able to still enjoy the mentioned features, one can, rather than to 
work in the general class, postulate a parametric subfamily. Of course, there is a price 
to pay: sacrificing generality comes at the risk of model mis-specification. A good 
balance then must be struck between model flexibility and analytical tractability. 

This raises the issue of model construction and model choice. No model can 
be expected to work well in all situations. New data may require new models. 
However, because of the constraints that a dependence structure must fulfil in 
order to be an extreme value dependence structure, it is not straightforward to 
generate valid parametric families, let alone useful ones. In section 9.2.1, we list 
a number of tools for generating multivariate extreme value models. An overview 
of the most popular models is given in section 9.2.2. 

9.2.1 Model construction methods 

Max-stable processes 

Loosely speaking, max-stable processes are stochastic processes of which all finite¬ 
dimensional distributions are multivariate extreme value distributions. They can be 
viewed as infinite-dimensional generalizations of extreme value distributions. A 
spectral representation for such processes by de Haan (1984) was turned into a 
versatile tool by Smith (1991) for a construction method for multivariate extreme 
value distributions that allows certain characteristics of a physical process under 
study to be incorporated into the model. 
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Consider the following situation. A certain system V is affected by a collection 
of shock events of possible different sizes and different types. Event number i has 
size 0 < r, < oo and is of type in some classification space S. The impact 
caused to an element v of the system V by an event of size r and type ^ is equal 
to rf(s, v), where / : 5 x V ^ (0, oo) is the so-called event profile function. The 
aggregate impact Z v to an element v caused by all events i is equal to the maximal 
impact by each of the events, that is, Z v = i；)}. 

Now we make the following assumption: (i) the events (r“ Si) form a Poisson 
process on the space (0, oo) x S with intensity measure r~ 2 dr v(dy) for some 
measure v on S, and (ii) for all i ； g V we have f s f(s, i;)v(d5) = 1. The measure 
v is called the frequency measure, as it describes the relative frequency with which 
events of certain types occur. The second assumption is just a normalization. 

Under this assumption, we can find the distribution of the vector (Z v : v G Vb), 
where Vq = {v\,..., Vd] ^ V. For 0 < < oo, we have 


P[Z V < z v , Vu g Vo] 

=v) < Zv, v/, Vl ； G Vo] 


p 


n ma.x{z~ l f(si, v)} < 1, V/ 

V€Vq 


: exp I 


exp 


is Jo 


v)} > 1 


r~ z dr v(ds) 


/ maxU^/O, r)Mdy) 

Is 


(9.2) 


Because of the normalization assumption, we have P[Z V < z v ] = exp(—1/z^), that 
is, the marginal distributions are standard-Frechet. Moreover, the distribution of 
the vector (Z v : v g Vo) satisfies the max-stability relation (8.7). All in all, we 
conclude that (Z v : v g Vo) has a multivariate extreme value distribution with 
standard-Frechet margins. 

Examples of this construction are the Gaussian model for spatial extremes of 
rain storms in Smith (1991) and Coles and Tawn (1996a), and the directional model 
for extreme wind speeds in Coles and Tawn (1994); see also section 9.2.2. The 
extremal coefficients of the process (Z v : i; g V) in the sense of (8.55) are given 
by 0 Vo = f s max ve y 0 f(s, v)v(ds). 


Spectral densities 


Recall from Chapter 8 that one of the representations of the dependence structure 
of a multivariate extreme value distribution was in terms of a so-called spectral 
measure H, which in the bivariate case is the positive measure on [0, 1] given by 

(8.28) . Hence, assuming that H is absolutely continuous, we may construct models 
for H by modelling its density h. However, we must take care that the constraint 

(8.29) is fulfilled, that is, we need L coh(co)dco = 1 and (1 — co)h(co)dco = 1. 
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Coles and Tawn (1991) describe a way to modify an arbitrary non-negative 
function h* on (0, 1) that does not satisfy the constraints to a function h that 
does. Let 

m\ = I uh*(u)du, m 2 = I (1 — u)h*(u)du. (9.3) 

Jo Jo 

Without loss of generality, assume mi > 0 and m 2 > 0. From Pickands (1981), we 
know that 


P[X\ > xi, X 2 > X 2 ] = exp 



[(ux\) V {(1 — u)x 2 }]h"{u)du 


is the joint survivor function of a bivariate min-stable distribution with exponential 
margins with expectations E[Xj] = 1 /mj for j = 1, 2. Hence, writing P[m\X\ > 
i>i, 1712 X 2 > V 2 ] = exp{—/(ui, U 2 )}, we find that 


V 2 ) — 





h*(u)du 


is a stable tail dependence function, see also (8.51). Change variables u = m\co/ 
{m\co + m 2 (l — co)} to find l(v\, V 2 ) — Jq[(ojvi) V {(1 — co)v 2 }]h(co)dco, where 


h(co) 


m\m2 


{m\CL) + 爪 2(1 — ⑺)} 3 


h* 


m\co 


m\(o + m 2 (l — co) 


(9.4) 


for 0 < a) < 1. This h then must be the density of a spectral measure H. One can 
also verify directly that the constraints are satisfied. 

The procedure can be extended to accommodate for spectral measures H 
with point masses at 0 or 1. More importantly, Coles and Tawn (1991) gener¬ 
alize the argument to higher dimensions. For a non-negative function h* on the 
unit simplex Sd such that mj = 入 (dw) is positive and finite [where 

X(dc«>) = do>i... dcod-i denotes the (d — 1)-dimensional Lebesgue measure on Sd], 


h(co) 


where 


m\ ... nid 


u . = 

] m\co\ H - 1- m d cod 


h (wi ， ... ， wj), 


(9.5) 


defines the density of a measure H concentrated on the interior of Sd and satisfy¬ 
ing (8.26). 


Order restrictions 

Sometimes, the variables that we want to model satisfy certain order restrictions. 
For instance, if M\ and M 2 denote the maxima of respectively the hourly and 
two-hourly aggregated rainfall amounts during a certain period at a certain loca¬ 
tion, then necessarily M\ < M 2 < 2M\. Nadarajah et al. (1998) propose ways to 
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construct models for bivariate extreme value distributions that can incorporate such 
restrictions. 


For simplicity, we restrict attention to the case where the margins are standard- 
Frechet. Let G* be a bivariate extreme value distribution with standard-Frechet 
margins, and let the random pair (Z\, Z 2 ) have distribution function G*. Let H be 
the spectral measure of G* as in (8.28), that is 

1 

P[Zi < zi, Z 2 < Z 2 ] = G*(zi, Z 2 ) = exp 


/ 


V 


- co 


Zi 


Z2 


-ff(dco) 


For 0 < m < 00 , we have 


P[Z2 < m 


ZJ = lim 

HO to 


< Z\ < (k 1)5, Z 2 < mk8] 


lim6 V 

n0 to 


G^{(k + 1)5, mk8] — G^(k8, mk8) 

~8 


r ]im GM + 8,mz)-GM,'nz) dz 

Jo HO 8 



G^(z,mz)z~ 2 dz / coH(dco). 

J(l/(m+l),l] 


Hence P[Z 2 > mZ\] = 1 provided + 1), 1]) = 0, that is, the spectral 

measure is concentrated on [0, l/(m + 1)]. Similarly, P[Z\ > mZ〗]=1 provided 
i/([0, m/(m 1))) = 0, that is, // is concentrated on [m/(m + 1), 1]. The require¬ 
ments (8.29) force m < 1. 

All in all, we can implement order restrictions on Z\ and Z 2 by letting the 
spectral measure H be concentrated on a subinterval [a, b] of [0, 1], where 0 < 
ci < 1/2 < b < l. Observe that if a = 1/2 or Z? = 1/2, then in view of (8.29), H 
must be concentrated on 1 /2, corresponding to complete dependence. So assume 
a < 1/2 < b. 

Nadarajah et al. (1998) describe a method to construct such H starting from 
an initial measure H* with density h* and satisfying (8.29). For 


0< Ka < 


2b- 1 
b — a 


0<y b < 


l-2a 
b — a 


define H by its point masses on a and b and its density h on {a, b) through 
H({a}) = y a , H({b}) = yb, and, for a < co < b, 


h(co)= 


(b — a) (a/3) 2 
[a(co — a) — b)} 3 


a(w — a) 

a(w — a) + p(b — w) 


where a = 2b — l — (b — a)y a and = 1 — 2a — (b — a)yt,. Then H satisfies 
(8.29) and is concentrated on [a, b], as desired. If y a and yb are equal to their 
respective upper boundaries, then h = 0, so that H merely consists of atoms at 
{a} and {b}, which is the so-called natural model, already introduced by Tiago de 
Oliveira (1980, 1989b). 
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9.2.2 Some parametric models 

Logistic model and variations 

Basic form. In its simplest form, the logistic model has stable tail dependence 
function 


liVx, v 2 ) = {v\ ,a + vl /a ) a , vj > 0, (9.6) 

with parameter 0 < a: < 1. Introduced by Gumbel (1960a,b), it is the oldest para¬ 
metric family of bivariate extreme value dependence structures. Because of its 
simplicity, it is still the most popular one. From (8.35), we can compute easily 
that the corresponding spectral measure H does not have point masses on 0 or 1, 
while by (8.36), its spectral density h on (0, 1) is 

1 _ Ql 

h(co) = -{&)(1 - w)} 1/a — 2 {(1 - co) l/a + co 1 ' 01 } 01 - 2 . 

a 

The parameter a measures the strength of dependence between the two coordi¬ 
nates. In particular, independence and complete dependence correspond to a = l 
and a ^ 0, respectively. An interesting interpretation of the parameter a is given 
in Ledford and Tawn (1998): they show that in a random sample from a bivariate 
extreme value distribution with logistic dependence structure, the probability that 
the maximum values in both coordinates occur at the same pair of observations 
converges to 1 — a as the sample size tends to infinity. Further, Kendall’s tau 
(8.57) is given by r = 1 — a (Oakes and Manatunga 1992), whereas the correla¬ 
tion between the two coordinates after transformation to Gumbel or exponential 
margins as in (8.58) or (8.59) is equal to 1 — a 2 or Qfr 2 (a){r(2of)} -1 — 1 respec¬ 
tively (Tawn 1988a; Tiago de Oliveira 1980). The extremal coefficient in (8.56) is 
/(l, 1) = 2 a . All in all, the strength of dependence increases as a decreases. 

Asymmetric logistic model. The logistic model has the disadvantage that it is 
symmetric in the two variables. An asymmetric extension, proposed by Tawn 
(1988a), is 

Uvuvi) = (1 - in)vi + (1 — 协 2 + {0Mi) 1/a + {f 2 V 2 ) l/a r (9.7) 

for Vj > 0, with parameters 0 < of < 1 and 0 < ^ < 1 for ) = 1,2. For \j/\ = 伞 2 , 
we obtain a mixture of independence and the logistic model; in particular, for 
= t/t 2 = 1, the model reduces to the logistic model (9.6). Independence arises as 
soon as a = 1 or = 0 or 少 2 = 0. If a < 1, the corresponding spectral measure 
H has point masses //({0}) =1 — ^2 and //({l}) = 1 — \j/\, while the spectral 
density h is given by (8.37). Figure 9.2(a) shows the Pickands dependence function 
A{t) = /(I — t, t) for a number of parameter values. 

For a ^ 0, we get the non-differentiable model 


l(vu vi) = max{(l — + v 2 , 1^1 + (1 - 少 2 ^ 2 }. 


(9.8) 
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Its spectral measure H is concentrated on three points: //({0}) = 1 — \j/ 2 , 

( 少 l + 少 2 )})= 於 1 + A, and i/({l}) = 1 - If A = % in (9.8), we get a 
bivariate model discovered by Marshall and Olkin (1967) in the context of sur¬ 
vival analysis and recognized as an extreme value dependence structure by Tiago 
de Oliveira (1971), who called it the Gumbel model. Also, choosing = 1 or 
t /^2 = 1 in (9.8) yields the so-called bi-extremal model (Tiago de Oliveira 1969, 
1974). Complete dependence arises if i/rj = t /^ 2 = 1. 

Bilogistic model. In the asymmetric logistic model, the spectral measure H of 
(8.30) may put non-negative mass on the boundary points 0 and 1, which com¬ 
plicates likelihood inference in certain point-process models for high-threshold 
exceedances, see section 9.4.2. Therefore, starting from the representation (8.39) 
of a bivariate extreme value dependence structure in terms of spectral functions, 
Joe et al. (1992) propose the model 

l(v\, V 2 ) = f max{(l - (1 - 0)(1 - 0 2 }d ，， （ 9.9) 

Jo 

where 0 < a < 1 and 0 < ^ < 1. The model is another asymmetric extension of 
the logistic model (9.6) to which, it reduces if a = /3. The parameter (a 4 -卢 )/2 
may be thought of as measuring the strength of dependence, while a — p measures 
the amount of asymmetry. From (8.35) we find that the spectral measure H does 
not put any mass on 0 or 1, whereas (8.36) only leads to an implicit formula for 
its density h on ( 0 , 1 ) in terms of the root of a certain equation. 

Tajvidi’s generalized symmetric logistic model. Tajvidi (1996) proposes the fol¬ 
lowing extension of the bivariate symmetric logistic model (9.6): for Vj > 0, 

l(vi ， v 2 )= {v\ ,a + 2(1 + ^)v\ la v] la + v]l a r 12 

where 0 < a < 1 and —1 < ^ < 2(a~ l — 1). The model seems to have an iden- 
tifiability problem as it reduces to (9.6) with shape parameter a for \j/ = 0 and to 
(9.6) with shape parameter a/2 for 少丄 一 1. Complete dependence arises as soon 
as a ^ 0 , while independence occurs as a = 1 and \j/ = 0. 

Multivariate extensions. With the aim of constructing spatial models for envi¬ 
ronmental extremes, Tawn (1990) proposes the following generalization of the 
asymmetric logistic model (9.7) to an arbitrary number, d, of dimensions. Let Cj 
be the collection of non-empty subsets c of {1,, d}. The multivariate asymmetric 
logistic model is defined by 

! «c 

(9.10) 

for v g [0, oo); here 0 < a c < 1, \l/ c j > 0, and ^cj = 1 for 7 = 1,..., If 
丄 0 for all c e Cd, we get a model originally due to Marshall and Olkin (1967), 
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while the model arising when the \j/ c j do not depend on j is already studied 
by McFadden (1978). Smith et al. (1990) apply a certain tri-variate sub-model to 
sea-level data on the coast of England. 

The spectral densities corresponding to (9.10) can be computed from (8.34) 
and are given explicitly in Coles and Tawn (1991). Simulation methods for the 
multivariate logistic model are developed in Stephenson (2003) and the references 
therein. 

A simple special case of (9.10) is 

1 {V) = {v\' a + ■ ■ ■ + v l J a T, (9.11) 

the multivariate symmetric logistic distribution. Genest and Rivest (1989) charac¬ 
terize the copulas corresponding to (9.11) as the only extreme value copulas that 
are also Archimedean copulas. 

Tawn (1990) also mentions an extension of (9.10) in a kind of nested structure 
involving a hierarchy of levels, thereby generalizing models studied in McFadden 
(1978). A special tri-variate case applied in Coles and Tawn (1991) to oceano¬ 
graphic data is the nested logistic model, 

l( Vl ,v 2 ,v 3 ) = {{v\' a + v\ ,a ) alli + (9.12) 

where 0 < a < ^ < 1 , featuring bivariate symmetric logistic dependence with 
parameter a for the first two coordinates and with parameter p for the other pairs 
of coordinates. Proceeding in a recursive manner from (9.12) to higher dimensions 
leads to a model described by Joe (1994). 

Yet another multivariate extension of the logistic model is the time series logis¬ 
tic model by Coles and Tawn (1991). The idea is to start from a first-order Markov 
process ..., for which the bivariate dependence structures of the pairs 
(Xj, Xj^i) fall in the (differentiable) domain of attraction of the bivariate symmet¬ 
ric logistic model (9.6). Then, by Markov dependence, actually the joint dependence 
structure of the vector (Xi, … ， Xd) lies in the domain of attraction of a J-variate 
dependence structure, coined the time series logistic model. In section 10.4, this is 
implicitly used to model the extremes of time series with Markov structure. 

Negative logistic models and extensions. The negative logistic model introduced 
by Joe (1990) is quite similar in form to the logistic model. In its asymmetric 
version, the bivariate negative logistic model is defined by 

l(vu V 2 ) = vi-\-v 2 - + (^ 2 ^ 2 ) 1 /( T, (9.13) 

where —oo < a < 0 and 0 < ^- < 1 for j = 1,2. Independence arises as soon 
as a ^ —oo or \j/\ = 0 or 1/^2 = 0. If a ^ 0, we rediscover the non-differentiable 
model (9.8). The model is symmetric for \j/\ = \j/ 2 . Figure 9.2(b) shows the Pickands 
dependence function A(t) = 1(1 — t, t) for 伞 \ = t/t 2 = 1 and a number of values 
for a. 
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We can, as usual, compute the spectral measure H from (8.35) and (8.36): we 
find //({ 0 }) = 1 — # ， //({l}) = 1 — t/^i, and spectral density 

h(co) = (1 一 - ®)} 1 / 0 ， - 2 [{^i(l- «)} 1/a + (^2Co) 1/a ] a - 2 

for 0 < co < 1 . 

In the same way as the bilogistic model is an asymmetric extension of the 
bivariate symmetric logistic model, the negative bilogistic model by Coles and 
Tawn (1994) is an extension of the bivariate symmetric negative logistic model. 
The stable tail dependence function is again (9.9)，but now with parameter ranges 
—oo < a <0 and —oo < ^ <0. The spectral measure H does not have point 
masses on 0 or 1 , although its density h on ( 0 , 1 ) can only be expressed in terms 
of the root of a certain equation. A little reflection shows that one could even 
consider (9.9) with 0 < a < 1 and —oo < < 0 or vice versa, thereby obtaining 

some kind of hybrid between the bilogistic and negative bilogistic model. 

The general multivariate version of (9.13) is 

d 

7=1 ceCd'.\c\>2 

where Cj is the collection of non-empty subsets of { 1 ,..., d}; the parameter ranges 
are -oo <a c <0, \l/ c j > 0 , and Y. C 3 j,\c\> 2 ^~^ C ^c,j < 1 for all 7 = 1 ,, d. 
Also for this model, formula (8.34) can be used to find the spectral densities 
of the corresponding spectral measure H, see Coles and Tawn (1991). A related 
multivariate extension is proposed in Joe (1994); see also Kotz and Nadarajah 
(2000), p. 130. 

Polynomial Pickands dependence function 

Kllippelberg and May (1999) describe the class of Pickands dependence functions 
A that have a polynomial form, 

A(t) = yj/Q -\- yj/\t -\- + ... + 严， 0 < f < 1, (9.14) 

with m a positive integer. The conditions A(0) = 1, A(l) = 1, 0 > A^O) > —1, 
0 < A r {\) < 1, A /r (0) > 0 and A ,r (\) > 0 imply the necessary restrictions 

少 o = 1 

少 1 = —OhH - h ^m) 

0 < 1/^2 + • • • + < 1 (q 

^2 > 0 K } 

0 < 1^2 + 2^3 + • • • + (m - < 1 

^2 + 3^3 + ... + {^)^m > 0 

which, however, are not sufficient, in general, to guarantee that A(t) in (9.14) is a 
Pickands dependence function [for instance, the function A(t) = l — t 3 1 4 does 
satisfy (9.15) but is not convex]. 
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The spectral measure H can be easily computed from A through (8.47). In par¬ 
ticular, H({0}) = 1 — ( 於 2 + ... + D and //({l}) = 1 — { 少 2 + 2 + + — 

mh while its density on (0, 1) is h = A". Since A is a polynomial, complete 
dependence, A(t) = max(/, l — t), can only be attained as m —> oo. The linear 
case, m = 1, admits as only solution A(t) = 1, corresponding to independence. 
Most relevant for statistical purposes are the quadratic and the cubic case, corre¬ 
sponding to the mixed and asymmetric mixed model, respectively. 

Quadratic case: mixed model. If m = 2 in (9.14), then we must have —^\ = 
x/z z= xJ /2 e [ 0 , 1 ], leading to the (symmetric) mixed model 

A(t) = \-xlft+ xl/t 2 , 0 < f < 1, (9.16) 

appearing already in Gumbel (1962). Observe that this model also arises as a 
special case of the negative logistic: in (9.13), take a = —l and i/q = 少 2 = 诊 . 
Independence arises for = 0, but complete dependence is not possible in this 
model. For a random pair with dependence structure (9.16), the correlation coef¬ 
ficient is 671 "— 2 {arccos(l _ 於 /2 )} 2 e [0, 2/3] if the margins are transformed to the 
Gumbel distribution as in (8.58) (Tiago de Oliveira 1980)，and an even more com¬ 
plicated expression if the margins are transformed to the exponential distribution 
as in (8.59) (Tawn 1988a; Tiago de Oliveira 1989b). 

Cubic case: asymmetric mixed model. If m = 3 in (9.14)，the Pickands depen¬ 
dence function takes the form 

A(t) = 1 — （伞 2 + + f 2 t 2 + f 3 t 3 ， 0<t<l, (9.17) 

see Figure 9.3(a). The conditions (9.15) reduce to 

中 2 > 0, \j/2 + 3^3 > 0, 中 2 + 伞 3 < Y, ^2 + 2-^3 < 1, 

which are also sufficient to guarantee that A(t) in (9.17) is a Pickands dependence 
function. Independence occurs at 1^2 = ^3 = 0, a comer of the parameter space, 
while, again, complete dependence is not possible. 

Gaussian model 

The Gaussian model is defined by its stable tail dependence function 

Kvi, V 2 ) = + (2 入 ) _1 log(vi/i ； 2 )} + 少{入 + (2 入 ) _1 log(v 2 /vi)} (9.18) 

with parameter A. g [0, 00 ], and with O the standard normal distribution function, 
see Figure 9.3(b). The cases 入 = 0 and A. = 00 correspond to complete dependence 
and independence, respectively. The extremal coefficient in (8.56) is /(l, 1)= 
20( 入）， that is, dependence decreases as X increases. By (8.35), the spectral measure 
H does not put mass on 0 or 1, while its density on (0, 1) can be easily computed 
from (8.36). 
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Hiisler and Reiss (1989) characterized the model as the limit dependence struc¬ 
ture of the suitably normalized component-wise maximum in a triangular array 
X\ n ,, X nn of independent, centred, unit-variance bivariate normal random pairs 
with correlation p n such that (1 — p n ) log(n) — 入 2 as w > oo. Hooghiemstra and 
Hiisler (1996) prove a related characterization in terms of projections of standard 
normal pairs in directions in the neighbourhood of a given fixed direction. Coles 
and Pauli (2001) find comparable results for a class of bivariate Poisson distribu¬ 
tions, the heuristic being that a Poisson distribution with a large intensity can be 
approximated well by a normal distribution. 

The model can also be obtained by the method of max-stable processes as in 
(9.2), incidentally yielding a generalization to higher dimensions. Let both the index 
set V and classification space S be M, the frequency measure v be the Lebesgue 
measure, and the event profile function v) be the probability density function 
in s of the normal distribution with mean v and variance a 2 . Then the stable tail 
dependence function of the pair (Z v , Z v ') is equal to (9.18) with 入 =|i ； — v r \/{2a). 
A further extension of the model to V = R 2 is used in Smith (1991) to describe 
spatial dependence between storms in function of the distance between the storm 
locations; see also Coles and Tawn (1996a) and Schlather and Tawn (2003). An 
alternative multivariate extension is proposed in Joe (1994). 


Circular model 


Using the technique of max-stable processes in (9.2), Coles and Walshaw (1994) 
construct a model for describing dependence between annual maxima of wind 
speeds recorded at a fixed location across continuous directional space. A typical 
storm has a principal direction 5 G 5 = (0, 2n] and strength 0 < r < oo. Its relative 
strength at direction v e. V = (0, 2jt] is rf^(s, v), where 


v)= 


2nlo(0 


exp{f cos(5 — u)}, 


with 0 = f < oo and / 0 (f) = exp{f cos(5)}d5 the modified Bessel 

function of order 0. The function /〆• ， u) is the von Mises circular density with 
location and concentration parameters v and f respectively. 

By (9.2), the joint distribution of the maximal wind speeds (Z v : v e Vq) 
recorded in a given year at a collection Vo C (0, 2 n] of directions and transformed 
to standard-Frechet margins is then given by 


P[Z V < z v , Vi; g Vq] = exp 



veVo 


for 0 < < oo. Large values of f correspond to profiles that are highly concen¬ 

trated around a single direction, the limit f > oo being that of independent Z v . 
On the other hand, ^ = 0 gives a constant profile and complete dependence. 
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Dirichlet model 


The following model is an example of the construction of (9.4). For positive 
numbers ai and a 2 , let 


h*(u)= 


r(aj -ha 2 ) ^ 

— 1 

r(ai)r(a 2 ) 


(l-uf 2 - 1 , 


0 < u < 1; 


observe that this is the probability density function of a Beta distribution. We have 
m.j = aj/(a\ + a 2 ) for 7 = 1,2 in (9.3), so that by (9.4), we obtain after some 
calculation, 


… a < f 1 a2 2 r(a 1 + a 2 +l) _ w )«2-i 

h(co)= --- ； — —. 

r (ai)r(a2) [a\co + «2(1 — <^))}«1+«2+1 

for 0 < cl) < 1, the density of a measure H on (0, 1) satisfying (8.29). 
In d dimensions, we start from the Dirichlet density 


h*(u) — 


r(Qfi + ... + ad) ai-l 

r^o-.-rcad) Ml 




on m g Sd, with parameters aj > 0 for 7 = 1,..., As its jth moment is mj = 
aj/(a\ + … + ad), we obtain from (9.5) the spectral density 


h(co)= 


r(ai H - \-a d -\-l) 

H - h oc d a) d 


n 

7=1 


a j 


a j- 


r(of)) (aicoi H - h oi d co d ) a J 


for (o G Sd, which is called the Dirichlet model in Coles and Tawn (1991). 


Piecewise algebraic spectral density 

For the bivariate case, Nadarajah (1999) proposes a spectral measure H in (8.28) 
with point masses at 0, 0 < ^ < 1, and 1, and with spectral density h on (0, 0) 
and (0, 1): 

付 (W) = Vx, x G { 0 , 0 , 1 }, 

_ J acJ if 0 < co < 0, 

= ^(l-co) s if0<oj<l. 

The parameter ranges are of > 0, > 0, r > —1, s > —1, and > 0 for x G 

{0, 0, 1}. The condition a0 r = P(l — 0) s ensures that h can be continuously 
extended in 0, whereas the requirement (8.29) is met as soon as ye < l / 
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{Ov (1-0)} and 


yo = 1 — (1 — 0)ye 


5 1 + 2 


(1 - 6>) 5+2 + a 6> r+1 


r + 2 r + 1 


Yi 


0y e 


a e r+2 - oy +l 


r + 2 


5 1 + 2 s + 1 


In total, the number of free model parameters is five. The model can accommodate 
for a wide range of characteristics of bivariate extremal dependence structures. 


9.3 Component-wise Maxima 

Let {Y\,... ,Yk} be an independent sample from a variate extreme value dis¬ 
tribution function G. In this section, we explain how to estimate G. We also 
consider the more general situation where the dependence structure of G is that of 
a multivariate extreme value distribution, whereas the margins are arbitrary, that is, 

G(y) = exp[-/{-log G 1 (y 1 ), ...,-log G d {y d )}l 

for some stable tail dependence function / and arbitrary continuous distribution 
functions Gj. 

These estimation problems may arise in a number of situations. The most 
typical one is where the Yi can be viewed as component-wise maxima over blocks 
of variables of some underlying series, {X \,..., X n ], that is, 

im 

Yi = V Z r , i = l ， ... ， k ， (9.19) 

where km = n. The X r may be observed or not. For instance, suppose X r j denotes 
the maximal water height at day r = l,... ,n on location j = l,d of a. cer¬ 
tain river. If blocks correspond to years (m = 365), then Yij is the maximal water 
height in year i at location j. With yj the height of a dike or dam at location j, 
the probability that there will not be a flood in year i at any of the d locations 
is G(y) = P[Yi < y]. This methodology is in fact the multivariate generaliza¬ 
tion of the annual maximum approach already advocated by Gumbel (1958); see 
also section 5.1. Historically, it marks the beginning of multivariate extreme value 
statistics based on the probabilistic theory of multivariate extremes (Gumbel and 
Goldstein 1964). 

In (9.19), the X r need not be independent or identically distributed. In the 
water height example, the X r will certainly feature within-year seasonality as well 
as temporal dependence, with high water levels persisting for a number of days in a 
row. Still, in the absence of long-range dependence, the operation of taking maxima 
over observations aggregated over a whole year may be reasonably expected to 
produce an approximately independent sample from a multivariate extreme value 
distribution, see section 8.4. 



314 


STATISTICS OF MULTIVARIATE EXTREMES 


The problem of estimating a multivariate extreme value dependence structure 
may be relevant outside extreme value statistics as well. Specifically, extreme value 
copulas form a large, non-parametric but still manageable subclass of the class of 
copulas with positive association. In this sense, they may be useful in modelling the 
dependence structure of random vectors for which positive association is a reasonable 
assumption. 

Of course, retaining only the component-wise maxima over large blocks of obser¬ 
vations, is, like in the univariate case, rather wasteful of data. In the multivariate 
set-up, the additional problem arises that the vector of component-wise maxima is 
typically not an observation itself as the maximal observations in each of the vari¬ 
ables need not occur at the same moment. In section 9.4, we will therefore consider 
the more realistic problem of estimating a multivariate extreme value distribution 
based on a random sample from a distribution in its domain of attraction, thereby 
extending the familiar threshold approaches in the univariate case of Chapters 4 and 
5 to the multivariate case. 

Broadly speaking, there are two approaches: non-parametric (section 9.3.1) and 
parametric (section 9.3.2). In the non-parametric approach, we focus on the bivariate 
case, the estimation problem usually being formulated as how to estimate the Pickands 
dependence function A, introduced in section 8.2.5. In the parametric approach, 
the unknown copula is assumed to belong to a certain parametric family, usually 
one of the families described in section 9.2. In section 9.3.3, we will illustrate both 
approaches with the Loss-ALAE data of Figure 9.1. 

In both approaches, the complication arises that, in practice, the margins of G are 
unknown. One option is to model the margins by univariate extreme value distribu¬ 
tions. If, on the other hand, we do not want to make any assumptions on the margins, 
then the alternative consists of estimating the margins by the empirical distributions. 
In any case, proper credit should be given to the statistical uncertainty arising from 
having to estimate the margins, although it is not clear how to do this in a semi- or 
non-parametric context. 


9.3.1 Non-parametric estimation 

Let the random pair (Fi, Y 2 ) have distribution function G with continuous margins 
G\ and G 2 and with extreme value dependence structure, that is, 


G(yi, yi) = exp 


log{Gi(yi)G 2 (^ 2 )}A 


， log{G 2 (j2)} ' 

Jog{G l (y 1 )G 2 (y 2 )}J] 


(9.20) 


where A is a Pickands dependence function, see section 8.2.5. The joint survival 
function of the pair ^ = — log G\{Y\) and rj = — log(^ 2 (X 2 ) is 


P[^ > X, rj > y] = exp 


~(x + y)A 


y 


+ 


lj > ^ > 


0 , y > 0. (9.21) 


Observe that ^ and rj have a standard exponential distribution. 
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How to estimate A from a random sample {(§/, ly) : i = l,k] from (9.21)? 
We will consider two families of estimators. The first one consists of refinements of 
and improvements over an estimator due to Pickands (1981), while the second one 
originates in a more recent proposal by Caperaa et al. (1997). A third approach, 
not discussed further in this section, consists of an estimator by Tiago de Oliveira 
(1989a), elaborated upon in Deheuvels and Tiago de Oliveira (1989) and Tiago 
de Oliveira (1992), the convergence rate of which is too slow to be practically 
applicable. 

In practice, we do not observe the ($, r]i) but merely the ( 3 ^ 1 , U. We cannot 
perform the required transformation as the margins are unknown. Hence, we have 
to replace the (§/, rji) by = — log G\(Yn) and fji = — log 62 (U; here Gj is an 
estimate of Gj, for instance, the (modified) empirical distribution function Gj(y)= 
(k + 1 )— 1 ^- =1 l(Yij < y) or 3 . member of a certain parametric family, typically 
that of the univariate extreme value distributions. 


Pickands estimator 

Let the random pair (^, rf) be as in (9.21). For t e [0, 1], 


P 


V 


\-t 


> x 


P[$ > (1 — t)x, T] > tx] 

exp{—xA(0}, x > 0. 


In words, the random variable min{§/(l — t), rj/t] has an exponential distribu¬ 
tion with mean Pickands (1981, 1989) proposed to estimate A{t) by the 

reciprocal of the sample mean of the min{§//(l — t), rji/t}. 


Kit) 


k 




与 i ^ 、 


t t 


e [ 0 , 1 ]. 


(9.22) 


The Pickands estimator, is conceptually simple and easy to compute, but has 
the drawback of not satisfying the necessary constraints to be itself a Pickands 
dependence function. This was the motivation for a number of modifications of 
the estimator. Denote the sample means of the ^ and the by ^ = k— 1 
and fj k = k' ~ l El l iy, respectively. Deheuvels (1991) proposed the variant 




k 




I,- A' 


t t 


一 （1 — 他一咏 +1’ t e [0, 1], (9.23) 


while Hall and Tajvidi (2000b) suggested 


1 




1 

— > min I 
k 乙 


e [ 0 , 1 ]. 


(9.24) 
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The estimator of Deheuvels verifies A^(0) = A^(l) = 1, and the estimator of Hall 
and Tajvidi satisfies Aj^ T (0) = A^ T (1) = 1 as well as A^ T (/) > max(/, 1 — t). 

Still, neither of the three estimators satisfies the constraint that a Pickands 
dependence function is convex. An obvious remedy is to replace an initial estimator 
Ak by its convex minorant, that is, the largest convex function on the interval [0, 1] 
that is bounded by Ak. Only in case of the Hall—Tajvidi estimator does the resulting 
modification satisfy all the constraints of a Pickands dependence function; for the 
Pickands and Deheuvels estimators, some further modifications are required to 
meet the constraint max(^, 1 — 0 ^ ^(0 ^ 1 for t G [0, 1]. 

A final method that has been investigated to improve estimation is through 
smoothing. This might be a particularly good idea if the objective is to estimate 
the second derivative of A, which, in case of a differentiable model, is equal to the 
density of the spectral measure H on the interior of the unit interval, see (8.47). 
Smith et al. (1990) investigate various kinds of kernel estimators based on the 
original Pickands estimator, but conclude that this offers little gain over the usual 
finite-difference approximation of the second derivative based on the Pickands 
estimator itself. Alternatively, Hall and Tajvidi (2000b) suggest to approximate 
an arbitrary initial estimator by a polynomial smoothing spline of degree three or 
more, the knots being equally spaced on the unit interval. For the choice of the 
smoothing parameter, they suggest a cross-validation method. They illustrate by 
means of a small simulation study that this form of smoothing may lead to better 
estimation of A, although they do not mention the effect of smoothing on estimat¬ 
ing the second derivative of A. Finally, the ideas of smoothing and taking convex 
minorants can be combined, in either order. 

Deheuvels (1991) showed convergence of the stochastic processes 

8lit) =k l/2 [{Al(t)}- 1 - 

in t e [0, 1] to centred Gaussian processes with covariance structures depending on 
A. On the basis of this result, Deheuvels and Martynov (1996) proposed to use the 
Cramer-von Mises type statistic = fQ{8^(t)} 2 dt to test the hypothesis of inde¬ 
pendence, A = 1. To implement the test, they compute and tabulate the critical 
values of the limit distribution of the test statistic, Tk, under the null hypothesis. 
The use of the convergence results and the proposed test are hampered in practice 
because the fact is ignored that, prior to the estimation of A, the marginal distri¬ 
butions have to be estimated as well. Although it seems reasonable to conjecture 
that this preliminary marginal estimation will not affect the root-k consistency of 
the proposed estimators for A, it seems equally probable that the asymptotic dis¬ 
tribution of the estimators will be different from when the marginal distributions 
are known. 
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Caperaa-Fougeres-Genest estimator 


Another type of estimator of A was proposed by Caperaa et al. (1997). The moti¬ 
vation and description we give here differ greatly from those in the cited paper 
and are intended to be slightly simpler. 

Let (^, r]) be a bivariate standard exponential pair with joint survival function 
given by (9.21). Then 

P[max{r§, (1 — t)r]} > x] 

=P[^ > x/t]-\- P[rj > x/(l — t)] - P[^ > x/t, r] > x/(l - t)] 

= exp(—x") + exp{—x/(l — r)} — exp[—xA(r)/{r(l — ，)}] 

for t G [0, 1] and x > 0, so that 


疒 oo 

E[\ogmax{t^, (1 — t)rj}] = log A ⑴ + / log(x)e _;c dx. 

" Jo 

This suggests estimating A{t) by the empirical version of the previous equation, 
that is, 


A 1 - . /*oo 

\ogA k (t) = - V'logmaxl^-, (1 - t)r]i} - / \og(x)t~ x dx. 

However, A 灸 does not satisfy the constraints A(0) = 1 = A(l). This is the moti¬ 
vation for the following modification, leading to the estimator of Caperaa et al. 
(1997): 

log 母⑴ (9-25) 

=log AW -rlogi A -(l) -( 1-0 log 4 ( 0 ) 

\ k \ k I k 

=-^logmax{f§i, (1 - t)r]i) - - (1 - /)- J]log 

K i=l K i=l K i=\ 

If required, further modifications are possible to make the estimator meet the 
constraints of convexity and max(?, 1 — t) < A{t) < 1. One such modification is 
proposed by Jimenez et al. (2001), although it leads to a consistent estimator only 
if A is also log-convex. 

Caperaa et al. (1997) also conduct an extensive simulation study comparing A p , 
A d and A c for a wide range of dependence structures. Their results strongly indi¬ 
cate that in general, A c performs better than A D , which in turn is preferable over 
A p . In a more restricted simulation study, Hall and Tajvidi (2000b) demonstrate 
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that further improvements can be made by taking the convex hull of either A c or 
A HT and applying constrained spline smoothing. 

9.3.2 Parametric estimation 

Let /(•; 0) be a parametric family of variate stable tail dependence functions 
indexed by the parameter vector 0\ see section 9.2 for a list of popular paramet¬ 
ric families. Assume that the variate distribution function G has an extreme 
value dependence structure with stable tail dependence function /(•; 0 ) for some 
unknown 0: 

G(y) = exp[-/{- log Gi(ji), ...,-log G d (yd )； 0}]. (9.26) 

How can we estimate 0 from an independent sample {F ? - : i = l,k] from G? 
The answer differs according to the nature of our assumptions on the margins of G. 

If the Yi arise as component-wise maxima over large blocks of variables, it 
is natural to model the margins Gj (j = 1, ..., J) as generalized extreme value 
(GEV) distributions 

Gjiyj) = exp J- (^1 + ^ J . (9.27) 

Here, is the extreme value index, while fij and crj > 0 are location and scale 
parameters, respectively. The combination of (9.26) and (9.27) now leads to a 
fully parametric model for G. The marginal and dependence parameters can be 
estimated simultaneously by maximum likelihood. Moreover, such joint modelling 
allows transfer of information from one margin to the other (Barao and Tawn 1999). 
Recall from section 5.1 that for the margin parameters, the estimation problem is 
regular for yj > —1/2. For the dependence parameter 0, complications may arise 
for parameter values corresponding to independence, and these have to be dealt 
with on a case-by-case basis, see below. 

However, jointly modelling the margins and the dependence structure may not 
always be desirable. For instance, goodness-of-fit tests may cast doubts on the 
hypothesis of extreme value margins, although we may still believe in (9.26). 
Conversely, Dupuis and Tawn (2001) show that mis-specifying the dependence 
structure may have large adverse effects on the estimates of the margin parameters. 
A more prudent approach, then, consists of the following. Write (9.26) as 

G(y; 0) = G d (y d );0}, (9.28) 

where 

C(u; 0) = exp[-/ {- log(wi),- \og(u d ); 0}] 
is the extreme value copula corresponding to /(.; 0). The copula density is 

d d 

c(m; 6) = - --- C(M ； 0). 

OU\•••OUd 



STATISTICS OF MULTIVARIATE EXTREMES 


319 


If we would know the margins Gj, then (9.28) would specify a parametric model 
for the distribution of the vector … ， Gd(Yd)). Now, as we do not know 

the margins, we replace them by the (modified) empirical distribution functions: 

1 k 

i=l 

Acting as if the G j were the true margins, we can estimate 0 by maximizing the 
pseudo-likelihood 

k 

L(0) = f] clGdYn),..., G d (Y id )； 0}. (9.29) 

i=l 

The resulting estimator for 0 is in fact a special case of the one considered in 
Genest et al. (1995). They establish asymptotic normality of the pseudo-maximum 
likelihood estimator for the parameter of a family of copulas in case the margins are 
unknown and are estimated by the empirical distribution functions. In particular, 
they show that the estimator is efficient in case the true parameter corresponds to 
independence. They also give an explicit expression for the variance-covariance 
matrix and propose a consistent estimator. 


Specific models 

Logistic model. Despite the large number of ad hoc methods for statistical infer¬ 
ence on the parameter a G (0, 1] in the symmetric bivariate logistic model (9.6) 
(Gumbel and Mustafi 1967; Hougaard 1986; Shi 1995b; Tiago de Oliveira 1980, 
1984, 1989b; Yue 2001), maximum likelihood estimation is nevertheless the most 
efficient. In case 0 < a < 1, the estimation problem is regular. Oakes and Man- 
atunga (1992) compute the information matrix in case of two-parameter Weibull 
margins, whereas Shi (1995a) computes the information matrix in case of gen¬ 
eralized extreme value margins and symmetric multivariate logistic dependence 
structure (9.11). Robust estimation within the bivariate logistic model is considered 
in Dupuis and Morgenthaler (2002). 

In the special case a = l, corresponding to independence, the estimation prob¬ 
lem is non-regular because of two reasons: the parameter is on the boundary 
of the parameter space, and the variance of the score statistic is infinite. Tawn 
(1988a) investigates this case more closely and comes to the following conclusions. 
Assume first that the margins are known. Then we can transform the observations 
to standard exponential margins with joint survival function as in (8.51) for / as in 
(9.6); denote the transformed sample by ( 与 “ rjf), i = l,..., k. The score statistic 
at a = 1 is equal to 

k 

Uk = r]i), (9.30) 

i=l 

where ij) = ^ log ^ r] log ij - log ( 巧 ) 

-(t + »7-2)log(t + »7)-(? + 7 ? r 1 , 
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that is, m(^, y]) is the derivative of the log-likelihood evaluated at a = l. The 
asymptotic distribution of the score statistic is 

(2~ ^ log A:)" 1/2 U k 4 - N(0, 1), Jt—oo. (9.31) 


Large negative values of the score statistic lead to rejection of the null hypoth¬ 
esis a = l versus the alternative a < l; in particular, the asymptotic p-value 
is p = log k)~ [ ^ 2 Uk}, with O the standard normal distribution function. 

Unfortunately, the convergence in (9.31) is rather slow, so that the asymptotic p- 
value may be far from the true one; therefore, Tawn (1988a) suggests to compute 
small sample critical values at the desired significance levels by simulation. Alter¬ 
natively, if Xk denotes the likelihood ratio, that is, the ratio between the likelihoods 
at the maximum likelihood estimate and at a = 1, then 


lim P[2 log < x]= 

k—oo ^ 


<D(x 1/2 ), if x > 0, 
0, if x < 0, 


(9.32) 


leading to a likelihood ratio test for a = 1 versus a < l. 

Typically, the marginal distributions are unknown and have to be estimated. 
Suppose that the margins follow an extreme value distribution and that the shape 
parameters are such that the maximum likelihood estimators are regular, that is, 
the extreme value indices are larger than —1/2, see section 5.1. Tawn (1988a) 
shows that if a = 1， then the maximum likelihood estimator for a is asymptotically 
independent from the maximum likelihood estimators for the margin parameters. In 
particular, having to estimate the margin parameters does not change the asymptotic 
behaviour of the score and likelihood ratio tests for independence. 

In the asymmetric bivariate logistic model (9.7)，the problems already encoun¬ 
tered for the symmetric bivariate logistic model are aggravated by the fact that 
the parameters xf/j are non-identifiable when a = 1. Tawn (1988a) pragmatically 
suggests to accept independence in the asymmetric model if it is accepted in the 
symmetric case. 

In the multivariate case, testing for independence is certainly not simpler 
than in the bivariate case. By way of example, Tawn (1990) mentions the rather 
non-standard asymptotic behaviour of the score statistics at independence for the 
multivariate symmetric model (9.11) and the nested model (9.12). As pairwise 
independence implies independence, a simpler approach consists of applying just 
the relevant bivariate tests. 

Finally, choosing between all the different logistic models is not easy. A proper 
understanding of the physical process generating the data should assist in identi¬ 
fying the appropriate structure, see, for instance, Tawn (1990). 


Mixed model. Also for the mixed model (9.16)，despite the abundance of ad 
hoc methods for statistical inference on xj/ (Gumbel and Mustafi 1967; Posner 
et al. 1969; Tiago de Oliveira 1980, 1989b; Yue 2000; Yue et al. 1999), maximum 
likelihood estimation is the most efficient. The estimation problem is regular if 
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Q < xj/ < 1 ， while at independence, 伞 = 0, the situation is completely parallel 
with the one for the bivariate symmetric logistic model (9.6) at independence, the 
only differences being that now the score function at 於 = 0 is 


晚"） = 




+ 2 




(? + v ) 3 


〜 2 

(§ + r ]) 2 


the asymptotic distribution of the score statistic C4 = Y^=i u d 仏 ） is 


(15—Vlog/t) — 1/2 I4 3 N(0, 1 )， A: ^ oo. 


and the null hypothesis 伞 =Q can be rejected in favour of the alternative 伞 > Q 
in case of large positive values of Uk, with p-value 1 — <t>{(l5~ 1 k\ogk)~ l ^ 2 Uk}', 
see Tawn (1988a), who also proposes a method to discriminate between the mixed 
and logistic model. 

For the asymmetric mixed model (9.17), Tawn (1988a) reports that the score 
vector at independence converges to a bivariate normal distribution, although he 
does not give any details. 


9.3.3 Data example 


We applied the methods of this section to the 1500 Loss-ALAE data of Figure 9.1. 
As the data do not arise from a time series, there seems no obvious way to parti¬ 
tion the data into groups. Therefore, we randomly permutated the data and formed 
k = 50 groups of size m = 30, seeking a compromise between the conflicting crite¬ 
ria of large group sizes and a large number of groups. Figure 9.4(a) shows, on a log- 
scale base 10, a scatterplot of the original data with, superimposed, the component¬ 
wise group maxima (yn, (i = l,k) computed in (9.19). We transformed 
the block maxima to standard exponential margins by 仏 = — log Gi(^/i) and 
iji = - log G 2 O/ 2 ), where 6) O) = (k-\- 1) _1 XlLi ^ij < y) fory = 1 , 2 are the 
(modified) empirical distribution functions, see Figure 9.4(b). 

Next, we estimated Pickands dependence function by the various parametric and 
non-parametric estimators of this section. Figure 9.5(a) shows the non-parametric 
estimators by Pickands (9.22), Deheuvels (9.23), Hall-Tajvidi (9.24), and Caperaa- 
Fougeres-Genest (9.25). The Pickands estimator does not satisfy the requirements 
A(0) = A(l) = 1; the estimators of Deheuvels and Hall-Tajvidi are modifications 
of the Pickands estimator to enforce this constraint. All estimators are clearly 
below the upper boundary of the triangle corresponding to independence. There 
also seems some evidence of asymmetry. Unfortunately, none of the estimators 
is convex, as a Pickands dependence function should be. A possible remedy (not 
shown) would be to replace the estimators by their convex minorants. 

Estimates that do satisfy all requirements of a Pickands dependence function 
result from fitting parametric models as in section 9.3.2. We decided to employ 
the semi-parametric likelihood of (9.29) with margins estimated by the empirical 
distribution functions rather than to fit the fully parametric model consisting of 
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Figure 9.4 Loss-ALAE data: (a) Scatterplot of the data with, superimposed, the 
component-wise maxima corresponding to a random partition of the data in 50 
blocks of size 30. (b) Scatterplot of these 50 component-wise maxima transformed 
to standard exponential margins. 
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Figure 9.5 Loss-ALAE data: Pickands dependence function estimates based on 

block maxima of Figure 9.4. (a) Non-parametric estimates by Pickands (. )， 

Deheuvels ( - ), Hall-Tajvidi ( - ), and Caperaa-Fougeres-Genest ( - ). 

(b) Semi-parametric estimates based on (9.29) for asymmetric logistic ( - ), logis¬ 
tic ( - ),and bilogistic ( - ) models together with non-parametric estimate 

by Caperaa-Fougeres-Genest (.). 
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(9.26) and (9.27) because the fit of the GEV to the margins was unsatisfactory 
(not shown). For simplicity, we only considered the logistic (9.6), asymmetric 
logistic (9.7), and bilogistic (9.9) models, although of course the other models in 
section 9.2 might have been tried as well. Since A\\) = \jr\ for the asymmetric 
logistic model, the non-parametric estimates in Figure 9.5 strongly suggest \j/i = 1, 
and indeed, imposing this constraint did not lead to a significant decrease in like¬ 
lihood. The parameter estimates are given in Table 9.1, and the corresponding 
Pickands functions are shown in Figure 9.5(b). For comparison, we also show 
the Caperaa-Fougeres-Genest estimate, which is close to the asymmetric logis¬ 
tic one. 

The p-value of Tawn’s score statistic for independence (9.30) is equal to 0.005, 
clearly rejecting independence. The likelihood ratio test of the logistic against the 
bilogistic model gives a p-value of 0.12，showing only weak evidence against 
symmetry. Alternatively, in the case of the asymmetric logistic model with \jr\ = 1, 
symmetry corresponds to the boundary value ^2 = 1. This time, the likelihood 
ratio statistic should be compared with a one-half chi-squared distribution with one 
degree of freedom (Tawn 1988a), also resulting in the p-value P[x 2 > 1.37]/2 = 
0.12. Note that in all these similar tests and confidence intervals, the estimation 
uncertainty arising from having to estimate the margins is not taken into account. 

An interesting way to visualize the estimated distribution function of 
component-wise maxima 


G(yi, y 2 ) = exp 


log{Gdyi)G 2 (y 2 )}A< 


log{G 2 (j2)} 

log{Gi(^i)G 2 (y2)}, 


is by quantile curves, 

Q(G, p) = {(yi, yi) : G{yi,y 2 ) = p), 


0 < p < 1. 


(9.33) 


Table 9.1 Loss-ALAE data: Parameter estimates, standard 
errors, and negative log-likelihoods for semi-parametric likeli¬ 
hood (9.29) with logistic, asymmetric logistic (constrained at 
\//\ = 1) and bilogistic models fitted to the block maxima of 
Figure 9.4. 


Model 

Parameter 

(Standard error) 

NLLH 

Logistic 

of = 0.73 

(0.08) 

92.55 

Asymmetric logistic 

a = 0.61 

(0.13) 

91.87 


f 2 = 0.58 

(0.30) 


Bilogistic 

a = 0.23 

(0.23) 

91.38 


p = 0.90 

(0.06) 
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Figure 9.6 Loss-ALAE data: Estimated quantile curves Q(F, p) (9.34) for p = 
0.98, 0.99, 0.995 based on block maxima of Figure 9.4(a) (non-parametric esti¬ 
mates for margins and Caperaa-Fougeres-Genest estimate for Pickands dependence 
function). 

As G(y\, y 2 ) = p if and only if there exists w e [0, 1] such that G\(y\)= 
p(i-w)/A(w) an( j (7 2 (y 2 ) = pW/A(w)^ ^ a bove quantile curve consists of the points 

Q(G ， P) = [ (Gr{^ (1 - w)/i(u,) }, dr{p w/Mw) }) : W e [0, 1]}. 

Exploiting the relationship F m ^ G, with m the block size (m = 30 for the Loss- 
ALAE data), quantile curves of G can be interpreted as quantile curves of F 

by ' 

Q(F,p) := Q(G, p m ). (9.34) 

Figure 9.6 shows the quantile curves Q(F, p) for p = 0.98, 0.99, 0.995 with the 
margins estimated non-parametrically and with Pickands dependence function esti¬ 
mated by the Caperaa-Fougeres-Genest estimator (9.25). 


9.4 Excesses over a Threshold 



zo+Ql. 9O+01.LOo+Ql. 

山 v—lv 


Let F be a variate distribution function and let X\,..., X n be an independent 
sample from F. Let x G be such that 1 — Fj(xj) is of the order l/n for all 
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j. How to estimate 1 — F(x)l Because we want to control the relative estimation 
error, the empirical distribution function is hardly of any use here. For example, 
for x = x n such that 1 — F(x n ) = l/n and with F n the empirical distribution 
function of the sample, the asymptotic distribution of {1 — F n {x n )}/{\ — F{x n )} 
is Poisson(l), whereas in fact we want it to converge to 1. 

In order to make any progress, we need to make some regularity assumptions 
on F that will allow us to extrapolate from within the sample region to its border 
or even beyond. Of course, we could assume a parametric model for F, but this 
assumption comes at a high price: how reliable do you believe the model to be 
outside the sample region? What we need instead is a more flexible assumption 
that still allows us to make the necessary jump out of the data. 

Therefore, we will assume that F is in the domain of attraction of a d- 
variate extreme value distribution function G, the dependence structure of which 
is described by the stable tail dependence function /, see (8.14). Observe that this 
assumption is much more realistic than the one of section 9.3, where we assumed 
the data to come from a multivariate extreme value distribution itself, rather than 
from a distribution in its domain of attraction. 

Recall that the condition F e D(G) motivates the approximation (8.81). There¬ 
fore, basically our estimator will take the form 

F(x) = exp[-/ {- log A ⑹，…，- log F d (x d )}]. (9.35) 

Here, the Fj(xj) are estimators for the marginal tails, typically by one of the 
methods of Chapters 4—5. So the task considered in this section is the estimation 
of the stable tail dependence function /, or, equivalently, of the exponent measure 
/x *， of the spectral measure S w.r.t. any two norms on M. d , or, in the bivariate case, 
of Pickands dependence function A, amongst others. 

We have seen that there does not exist a finite-dimensional parametrization of 
the class of dependence structures for multivariate extreme value distributions. To 
facilitate statistical inference, we may still assume a parametric model, preferably 
one that combines parsimony, analytical tractability and flexibility, and, if possible, 
is motivated from the data; see section 9.4.2. But first, we consider in section 9.4.1 
the more principled approach of not making such an assumption at all and estimat¬ 
ing the extreme value dependence structure in its full generality. In section 9.4.3, 
we will apply the techniques to the Loss-ALAE data of Figure 9.1. 


9.4.1 Noii-parametric estimation 

Estimation principle 


In order to estimate the stable tail dependence function, /, we treat the limit relation 
(8.90) connecting / and Cp, the copula of F, as an approximate equality for small 
enough s. Denote the coordinates of Xi by X" for j = 1, … ， d. Setting s = k/n 
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with k = k n ^ oo and k/n —> 0 leads to an estimator of the form 
1 n 

l(v) 1{3 ) = ... ，么 FMij) > 1 - (k/n)vj} 

K i=l 

j ^ * i I 

= tJ 2 1 \i G/n ， … . ， 1/^) • (9.36) 

Here, Fj is an estimator of the marginal distribution Fj (see below), whereas 
= 1/{1 — Fj ( Xij )}, j = 1， ... ， d . 

As we will see later, the estimator l of (9.36) is not directly suited to be substituted 
in (9.35), and will have to be modified. Still, formula (9.36) contains the gist of 
all estimators to come. 

Since the convergence in (8.90) is locally uniform in i; g [0, oo), we may 
replace 1 — svj by any function of the form 1 — svj + as ^ ^ 0. Taking empir¬ 
ical versions leads to variants of the estimators considered. For instance, the choice 
t~ sv i leads to / as in (9.36) but with X^ij = — 1/log Fj(Xij), see, for instance, 
Caperaa and Fougeres (2000a). Alternatively, Abdous et al. (1999) prefer (1 — s) v j. 
Also, rather than taking a single s, Abdous et al. (1999) propose to integrate over 
s > 0 with respect to a suitable kernel, thereby replacing the problem of how to 
choose s by how to choose the kernel and, more importantly, the bandwidth. 


Estimating the margins 

We still have to specify the Fj(Xij) in the definition of X^ij. There are two options: 
non-parametric or parametric. In the first option, we estimate Fj by the empirical 
distribution function or a variant of it. Denote the rank of X” among X\j ,, X n j 
by Rij = J2 n s =\ ^ x sj < Xij), that is, Rij = r if and only if X" = X (r)J , where 
X(\)j < ••- < X{n)j denote the order statistics of X” ,... ， X n j. Then possible 
estimators are 


尹 j( X ij)= 


n^\Rij - a), 
(n + 1)- 1 ^. 


for a G {0, 1/2, 1}, 


(9.37) 


For instance, the choice a = l leads to the so-called tail empirical dependence 
function 


~ 1 n 

l(v) = - ^ 1{3 j = 1, …， d : > n \ — kvj] 


(Drees and Huang 1998; Huang 1992). The motivation to modify the ordinary empiri¬ 
cal distribution function Fj (X"、 = n~ l in the way described is to improve the rel¬ 
ative estimation accuracy of 1 — Fj evaluated at high order statistics: for instance, if 
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Fj is continuous, then E[l — Fj(X( n )j)] = l/(n + 1), whereas 1 — Fj(X(n)j) = 0 
in case Fj is the ordinary empirical distribution function. 

The second option of estimating Fj(Xij) starts from the Generalized Pareto 
(GP) approximation to the tail function 1 — Fj (Pickands 1975), see Chapter 5. 
Choose the threshold uj = X( n —k),j ，the (k + l)th largest observation in the jth 
variable, and set 

i + 9j \ J j (9.39) 

where (yj, cfj) are estimators of the parameters of the approximating GP distribu¬ 
tion to the excess distribution over the threshold uj. In that case, 


一 Fj(Xij)= 


/(«) = -J]i = 1 ,...,^： (i + i>r 






The GP parameters can, for instance, be estimated by 


p. = Ml , + 1 - - 1 


U j 


(3Mb — M 2; ) 1/2 /I + 4(P 7 )_\ 1/2 


1 + (YjY 


,l + 2(y,)- 


where 


k-\ 




k 


^^{log — log X( n -k ： )j} r 


r=l,2, 


(yj)_ = (-yj) V 0 


(9.40) 


(9.41) 

(9.42) 


(de Haan and Resnick 1993). These estimators can only be used if the data are 
positive; in particular, they are not translation invariant. An alternative that does 
not share these drawbacks consists of estimating the GP parameters by maximum 
likelihood like in Smith (1987), see section 5.3. 

Whereas the marginal distributions do not influence the non-parametric version 
of Fj(Xij), the performance of the parametric version (9.39) strongly depends on 
the quality of the GP approximation to the excess distribution and also on the 
performance of the estimators of the GP parameters. The estimator (9.40) may 
perform poorly even if the convergence in (8.90) is fast. 


Exponent and spectral measure 

The estimator l of (9.36) with margins specified as in (9.37) or (9.39) leads to 
estimators of the exponent measure /x*, the spectral measure S w.r.t. two arbitrary 
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norms || • ||/ (i = 1,2) on and, in the bivariate case, Pickands dependence 
function A. First of all, 


l(v) = H,([0,oo]\[0, (l/v u ...,l/v d )]), 


(9.43) 


where 


~ 1 A (k ^ 、 


(9.44) 


is the empirical version of (8.94), treated as an equality at t = n/k. Important 
special cases are the tail empirical measure 


只 *(.) = I XI 


k 


k / 

derived from (9.38), and its semi-parametric variant 


(9.45) 


A*(.) 


k 


E: 


i + 巧 : 


i/y/ 


e • 


(9.46) 


derived from (9.40), see de Haan and Resnick (1993). 

Second, by (8.16)，we can turn jl* into an estimator of the spectral measure 
S: set 


1 

5 (.) =-工 1(^' > n/k, Wi e •), (9.47) 

K i=l 

where, with T as in (8.15), 

(Ri,Wi) = T(X,i) = (9.48) 

A useful choice of the two norms is the sum-norm, ||x|| = |xi| + ... + | 々 |, in 
which case the transformation T simplifies to 

Ri ^ X^n + ■ ■ ■ + X^d and % = X^ij/Ri- (9.49) 


The estimator of the spectral measure in (9.47) is based on all observations for 
which the radial component exceeds n/k. Their number is random, although by 
(8.97)，there must be approximately /:5(S) such observations. If we want to have 
exactly k observations involved in the estimation, we could choose s = 1/ 犮 („_ 幻 , 
the {k + l)th largest of the rather than s = kin as we did until now, giving 

~ , \ n ^ . 

S(') = R(n-k) - > € . }， 


(9.50) 
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or equivalently 

5(3) = -R (n -k), (9.51) 

n 

1 n 

^(•)/5(3) 二 -I -幻，免 e .}. (9.52) 

K i=l 

The above estimator of S(E) may be rather volatile in k, and a good idea is to 
take the average over a range of values (Caperaa and Fougeres 2000a). 

However, an even better idea might be to choose the two norms equal to the 
sum-norm: in that case, the total mass of the spectral measure (denoted now by H) 
on the unit simplex Sd is by (8.26) always equal to the number of dimensions, d. 
Replacing the estimate H(Sd) by its true value d in (9.52) leads to the estimator 

f n 

H(-) -幻，命, + e . }， (9.53) 

K i=l 

with Ri and Wij as in (9.49). If needed, the estimator of H can be turned into an 
estimator of the spectral measure w.r.t. two general norms through (8.38). 

Pickands dependence function 

In the bivariate case, we can transform the above estimators into estimators of 
Pickands dependence function A. Starting from (9.36)，we get 

~ ~ 1 n ^ 

A(t) = l(\ - t,t) = - ^ l[max{(l - t)X^n, tX^ i2 } > n/k]. (9.54) 

K i=l 

In particular, the extremal coefficient 0 = 2A(l/2) in (8.56) may be estimated by 
setting t = 1/2: replacing k by 2k and letting l be the tail empirical dependence 
function (9.38) yields 

~ 1 n 

^ = - l{max(/?,i, R i2 ) > n-k}, 

K i=\ 

variants of which are considered in Falk and Reiss (2001, 2003). Alternatively, 
since A(t) = tl{(l — t)/t, 1} we could use tl{(l — t)/t, 1} to estimate A{t) (Joe 
et al. 1992), although this estimator has the drawback of vanishing at t = 0, 
whereas in fact A(0) = 1. 

The above estimator for A is not convex, and this property can be ensured if 
we start from an estimate of the spectral measure rather than from the stable tail 
dependence function. If S denotes the denotes the estimator (9.47) of the spectral 
measure S, we obtain from (8.49), 

八 I n ^ ^ 

A(0 = > n / k )^7 l max{(l - t)X„ iU tX„ i2 }, (9.55) 
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with Ri as in (9.48). Estimator A was proposed by Caperaa and Fougeres (2000a) 
in case both norms are equal to the sum-norm. Finally, the estimator H in (9.53) 
leads via (8.46) to 

> 2 n ^ 

MO = 石二 > A«-fc)} m ax{(! - tWnJWn), (9.56) 

K i=l 

with Rt and Wij as in (9.49). 

The estimator of the extremal coefficient^ = 2A(l/2) corresponding to (9.56) is 
2 n 

§ = 蛾〉兵 (n-k)) max(W n , W n ). (9.57) 

K i=l 

Observe that this estimator is always smaller than two, that is, even if the margins 
are perfectly independent, the estimator will still point to asymptotic dependence. 
The origin of this deficiency can be traced back to the approximation (8.93) where¬ 
upon the estimator is based: as discussed already, the approximation tends to 
undervalue the true probability of joint occurrences of extremes, the consequence 
of which is an inherent bias towards stronger asymptotic dependence for estimators 
that are based on it. 

Finally, observe that the estimators A in (9.55) and (9.56) do not satisfy the con¬ 
straint max(^, l — t) < A(t) < 1. A possible solution consists in the modification 

A(t) = max{f, 1 - t, A(t) + 1 — (1 - 0^(0) - (9.58) 

Via the usual transformation formulae, for instance, (8.44) or (8.47), we can turn 
A into estimators of the stable tail dependence function or the spectral measure 
that satisfy all the relevant constraints as well. Still, the modification (9.58) is 
rather ad hoc, and it is not clear what the consequences are for the performance of 
the estimator. Moreover, the procedure cannot be generalized to higher dimensions. 
The problem of constructing truly non-parametric estimators of the spectral measure 
that satisfy all the necessary constraints remains open. 

Estimating F 

Now let us return to the problem of estimating 1 — F(x) for large xj as in (9.35). 
Typically, the marginal tail probabilities 1 — Fj(xj) will be of the order 0(l/n) or 
even smaller, so that the estimator / given in (9.36) is not suited to be substituted 
into (9.35), basically for the same reason why the empirical distribution is not 
a very good estimator in the first place: the estimator would involve a region 
of the sample space with (almost) no data. A possible remedy is to exploit the 
homogeneity of /: since l(tv) = tl(v) for ^ > 0 and v > 0 , we can put 

f(v) = i; e [0, oo) \ {0}, (9.59) 
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with / as in (9.36), while || . || denotes an arbitrary norm on M. d . Typical choices 
for the norm are the Euclidean norm ||v|| = + ... + f!)" 2 (de Haan and de 

Ronde 1998) and the max-norm ||v|| = |i；i| v • • • v \vd\ (Drees 2001). 

The estimator l(v) of (9.59) has the advantage that for any non-zero v, the 
number of observations used is of the order k. It also inherits the homogeneity 
property of /. However, l in (9.59) is not connected in a natural way to an exponent 
measure jl^ or a spectral measure S like / is connected to /l* and S in (9.43) 
and (9.47). 

An alternative is to start from S in (9.47) and exploit the connection between 
l and S in (8.23) to define instead, for v > 0, 

1 « d 

/' ⑻ 1 1 (免 > n/k)R i 1 \J (vjX^ij), (9.60) 

，=i )=1 

with Ri = WX^ || i ， Similarly, from (8.17), we can define fl* from S by 
/t* o r _1 (dr, dco) — r~ 2 drS(da>), 

with T the mapping (8.15). Both / from (9.60) and jl 本 above satisfy the required 
homogeneity properties; moreover, l is convex. 

Alternatively, taking both norms equal to the sum-norm, we can estimate / 
starting from the estimator H of (9.53) rather than from S, leading to 

j n d 

kv) = 7 E 1 {免 > 免幻 } V ( 〜成 7 )， (9.61) 


with Ri and 命 "as in (9.49). 

In the bivariate case, we can combine (9.35) with the definition of Pickands 
dependence function to find the estimator 


F(x\,X 2 ) = exp 


\og{F\(x\)F 2 (x 2 )}A I 


l0g{F 2 (X2)} 

Aog{Fi(xi)F 2 (x 2 )}, 


(9.62) 


This estimator coincides with the one of (9.35) if we set A(t) = 1(1 — t, t) for one 
of the choices of l above. Observe that this A is the same as the one in (9.55) or 
(9.56) for / as in (9.60) or (9.61), respectively. 


Literature overview 

The tail empirical dependence function (9.38) and tail empirical measure (9.45) 
were introduced by DM Mason in an unpublished 1991 manuscript and Huang 
(1992). Drees and Huang (1998) showed that the tail empirical dependence function 
attains the optimal rate of convergence for estimators of the stable tail dependence 
function. 



STATISTICS OF MULTIVARIATE EXTREMES 


333 


The estimator (9.46) of the exponent measure was first considered by de Haan 
and Resnick (1993). They proposed to estimate the Generalized Pareto parameters 
as in (9.41) and (9.42). The paper is one of the few ones in the literature on 
multivariate extremes that is written down for arbitrary dimension. 

For bivariate data, the estimator (9.47) of the spectral measure has been con¬ 
sidered in a number of papers. The first hint was given in de Haan (1985) for 
both norms equal to the Euclidean norm as in (8.31). The idea was taken up fur¬ 
ther by Einmahl et al. (1993) under the simplifying assumption that the marginal 
distributions are the same and heavy-tailed. The restriction of identical margins 
was removed in Einmahl et al. (1997)，who, for the combination of max-norm and 
Euclidean norm (8.33), proposed S as in (9.47) with jl* being the estimator (9.46) 
of de Haan and Re snick (1993). This estimator for S was modified into a fully 
non-parametric one in Einmahl et al. (2001) by choosing for jl* the tail empirical 
measure (9.45). Alternatively, Caperaa and Fougeres (2000a) considered the case 
where both norms are equal to the sum-norm; their estimator is computed as in 
(9.47) and (9.49) and with margins transformed to the standard Frechet distribu¬ 
tion. Still in the bivariate case, Abdous et al. (1999) replaced 1 — svj by (1 — s) v J 
in (8.90) and consider kernel variants of (9.36). 

The asymptotic theory for estimators of the dependence structure of extremes is 
rather involved, a major difficulty being the fact that the margins are unknown and 
are to be estimated as well. Useful tools in the area of local empirical processes 
can be found in Stute (1984) and Einmahl (1997). 

Estimation of 1 — F(x) can be paraphrased as estimation of the probability of 
the ‘failure region’ \ (oo, x]. More general regions are considered in de Haan 
and de Ronde (1998) and de Haan and Sinha (1999). In de Haan and Huang (1995), 
estimators of 1 — F(x) are turned into estimators of quantile curves Q(F, p)= 
{x G : 1 — F(x) = p] for small failure probabilities p. 


9.4.2 Parametric estimation 

We consider again the setting of variate observations x\,..., x n that can be 
assumed to be realizations of independent random vectors with common distribu¬ 
tion F, the aim being to estimate F(x) for x such that Fj(xj) is close to one. We 
assume that F is in the domain of attraction of some extreme value distribution 
function G, of which the stable tail dependence function belongs to some paramet¬ 
ric family, /(•; 0), indexed by a parameter (vector) 0, usually one of the families 
described in section 9.2. 

The domain-of-attraction condition together with the parametric specification 
of the stable tail dependence function leads by the theory in section 8.3 to para¬ 
metric models for F in regions of its support where all coordinates are large. The 
model parameters can be estimated by maximum likelihood, leading then to the 
desired estimates of F(x). Still, different formulations of the domain-of-attraction 
condition lead to different models and hence to different estimators. The two most 
popular methods are the so-called point-process method (Coles and Tawn 1991; 



334 


STATISTICS OF MULTIVARIATE EXTREMES 


Joe et al. 1992) and the censored-likelihood method (Ledford and Tawn 1996; 
Smith 1994; Smith et al. 1997), which we will discuss in turn in this section. For 
completeness, we mention that Tajvidi (1996) developed a procedure based on 
multivariate generalized Pareto distributions as in (8.68). 


Point-process method 

Coles and Tawn (1991) and Joe et al. (1992) found a way to turn the point pro¬ 
cess characterizations (8.73) and (8.98) into an estimation method. The method 
was applied to oceanographic data in Coles and Tawn (1994) and Morton and 
Bowers (1996). We present a derivation of the point-process likelihood by a quite 
different but simpler argument than the above authors, incidentally avoiding the 
point-process machinery. 

By (8.93), we find 

F{x) ^ 1 -/{l - Fi(xi), ... ， 1 — F d (x d ); 0), (9.63) 


provided all 1 — Fj(xj) are sufficiently small. Univariate theory suggests to model 
the margins by generalized Pareto distributions: for j = l,..., d and a high thresh¬ 
old uj, we model Fj on [uj, oo) by 

Fjixj) ~ 1 — \ (1 + yj J a J J ， xj > Uj. (9.64) 

with Xj = \ — Fj(uj). In terms of the function of (8.8)，we arrive at the model 
F(x) ^ 1 - V^{z;0), x eR d \ (—oo, m ]， (9.65) 


z j = z j 〈 x j) 


A 7 ^ + 

l/{l-Fj(xj)} 


if xj > Uj, 
if xj < Uj. 


Since Xj is close to zero, one can use the asymptotically equivalent marginal 
transformations 


lo ^{ 1 ~ x j{ l + yj Xj aj Uj ) + l/)J } ifx j 


-[/log Fj(xj) 


if xj < Uj. 


We use (9.65) to jointly estimate the marginal and dependence parameters 
from a sample xi,..., x n . First, we simply estimate Fj on the region (—oo, Uj] 
by the marginal empirical distribution function and assume it to be known in 
the subsequent analysis. Then, we estimate the parameters (yj, cfj), j = 1,..., J, 
and 0 by maximum likelihood, the likelihood contribution of an observation Xi 
depending on whether x, < m or not. On the one hand, if x, < u, then the likelihood 
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contribution is simply 


L(Xi) = F(u) ^ 1 — /(A.; 0). 


On the other hand, if x, ^ u, then the likelihood contribution is 

d d 


L(Xj) 


dx\ - - - dxd 
d d 


oc 


dZi … dZd 


F(Xi) 

v,(zr,o) n 




dxj 


where z" = Zj(xij). Defining r t = zn H - h Zid and Wi = r 厂 1 Zi ， we can use 

(8.34) to rewrite the latter likelihood as 

L(Xi) ocr7 (M) h(Wi ； 0) ]~[ ^ 

r-xij>uj 3 

where h ; 0) is the spectral density of the spectral measure // (- ; 0) on the interior 
of the unit simplex Sd. With N = {i = ,n : Xi ^ u}, the total likelihood of 
the parameters given the sample is then 

Lm'UAyj^j^O} 

oc{im ， n r r w+ 、 ( 《 ^)n^ 

ieN j : Xij >uj J 


Since /(A.; 0) < + • • • + is small and since |7V| is small in comparison to n, 

we can approximate the former by the simpler 

UWLi; (Yj^j) d j=^ 0 ^ 

oc exp{-/(nA.; 0)} ]~[ r^ (d+l) h(Wi ； 0) ]~[ (9.66) 

i 云 N j:xij>uj J 


This is indeed the likelihood obtained in Coles and Tawn (1991) and Joe et al. 
(1992). 

Optimization of the above likelihood is to be done numerically. A good ini¬ 
tial guess for the optimizers can be found as follows. First estimate each pair of 
marginal parameters (yj, <jj) separately by maximum likelihood from (9.64). For 
these estimates, compute Zi, and from Zi compute r/ and Wi. Now by (8.97), the 
probability density function of those Wi for which the corresponding r； exceeds 
some high threshold is approximately ; 0). Maximum likelihood estimation 

then yields an initial guess for 0. 

Unfortunately, the point-process method suffers from a number of defects. First 
of all, it uses (9.63) for x such that some 1 — Fj (xj) are small, whereas in fact 
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the stated approximation is valid only if all 1 — Fj(xj) are small. This improper 
use of (9.63) might corrupt the estimates of dependence parameters that are related 
to mass of the spectral measure not in the interior but on the lower-dimensional 
faces of the unit simplex. Joe et al. (1992) suggest a possible modification of the 
likelihood that should remedy the problem but do not pursue the issue further. 

A second defect of the above method is that approximation (9.63) itself is 
not without its own worries: in the text following equation (8.93), we explained 
already that the right-hand side of (9.63) need not define a proper distribution and 
that it tends to undervalue the probability of joint extremes in several coordinates 
simultaneously. The result of this undervaluation is that estimates of dependence 
parameters will show a tendency to be biased towards stronger dependence. In 
particular, asymptotic independence will be rejected too often. All these drawbacks 
are avoided by the censored-likelihood method, to be discussed next. 


Censored-likelihood method 


Let m be a multivariate threshold such that Fj (uj) = exp(—A y ) for some small, 
positive kj. Equations (8.63) and (8.64) suggest the following parametric model 
for F on the region [u, oo): 

F(x) ^ exp{—/(v; 0)}, x > u, (9.67) 

v i = + ， j = …, d. 


Observe that (9.67) entails the following model for the margin Fj on the region 
[My ， oo), j = 1,..., d\ 


Fj(xj) ^ exp 



(^ 1 + Yj 


x i - u j Y l/Yi 
a i )+ 



x i S Uj. 


(9.68) 


For small 入乃 this is approximately the same as the Generalized Pareto model (9.64) 
for the excess distribution over the threshold uj. 

The marginal parameters (Xj, yj, crj), j = l,d, and dependence parame¬ 
ters 0 can be estimated jointly by maximum likelihood. Observe that model (9.67) 
is only specified on the region [m, oo), and hence does not apply directly to obser¬ 
vations outside that region. The solution consists of considering the observation in 
a coordinate j that is smaller than uj to be censored from below at uj, hence the 
name ‘censored likelihood’. 

So, the likelihood of the parameters given a sample x\,... ,x n is 


〜 cfj) d j=v 0} = Y\ L(xt), 


( 9 . 69 ) 
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with the form of the likelihood contribution, L(x), of an observation x, depending 
on which of its coordinates exceed the corresponding threshold coordinates. For 
/ C {1,..., let be the region in W 1 of all x such that xj > Uj for j e J 
and Xj < Uj for the other j. Then for J = {j\, ..., j m ], the likelihood contribution 
of an observation x in the region Rj is proportional to 


L(x) oc P[Xj g dxj, j G /; Xj < uj, j ^ J] 
d m F 

oc - (x V u) 

dx h - -- dx M 

with F as in the right-hand side of (9.67). For instance, in the bivariate case, the 
plane is partitioned into four regions, depending on whether Xj (j = l, 2) exceeds 
Uj or not. The likelihood contributions are 


L(X\,X2) oc 


F(ui, u 2 ) 


3F 

a^ (xi>M2) 


dF 

^ 0<i ’ x2) 


if Xi < Ml, X 2 < U 2 , 
if X\ > Ml, X2 < U2, 

if X\ < Ml, X2 > U2, 


d 2 F 

dx\dx2 


(xux 2 ) 


if X\ > U\, X2 > U2, 


(9.70) 


with F and its partial derivatives computed according to the right-hand side 
of (9.67). 

Joint estimation of the marginal and dependence parameters has several advan¬ 
tages: transfer of information between variables, leading to better inference of the 
marginal parameters; proper assessment of the estimation uncertainty of the depen¬ 
dence parameters because of having to estimate the marginal parameters; possibility 
to incorporate connections between marginal parameters over different margins, 
for instance, a common shape parameter yj = y. A drawback of the method is the 
computational complexity, growing worse as the dimension increases. A good idea 
might therefore be to include a preliminary step, estimating marginal and depen¬ 
dence parameters separately, and then using these estimates as starting values for 
the optimization procedure leading to the joint estimates. 

The censored-likelihood method is first mentioned in Smith (1994). Ledford 
and Tawn (1996) give it its full development, focusing especially on testing for 
independence in the bivariate symmetric logistic model (9.6)，for which the point- 
process method of the previous paragraph is known to perform badly for the reasons 
mentioned there. The method is useful as well in the analysis of extremes of 
univariate Markov chains (Smith et al. 1997), see section 10.4.5. 
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9.4.3 Data example 

We continue the study of the Loss-ALAE data described in section 9.1. Whereas 
in section 9.3.3 we artificially partitioned the sample into blocks of equal size and 
extracted from each block the pair of component-wise maxima, now we will use 
all bivariate observations which are in some sense large. 

To apply the non-parametric techniques of section 9.4.1, we transform the data 
to standard Frechet margins by 

x*ij = — 1/ log Uij , i = 1, ..., n, j = 1,2, 

with Ufj as in (9.1), the alternative consisting of transforming to standard Pareto 
margins by x*ij = 1/(1 — w"). The transformation (9.49) of the pair (x^n, 
to pseudo-polar coordinates with both norms equal to the sum-norm takes the 
simple form 

= 又 */1 + 义 */2 ， ^ij = 义 *"/ 厂/， 

for i = l,... ,n and j = 1,2. Let 厂 (” < ••- < r (”） be the radial coordinates r z - in 
ascending order. 

If we are to construct estimates from the observations corresponding to the 
k largest r 卜 then a sensible choice of k might be found by inspecting the plot 
of {k/n)r{n-k) as function of k = l,... ,n — 1, see Figure 9.7(a). Recall that the 
estimator (9.50) of the spectral measure H may be written as 

n 

H (-)= r -^ > 〜-幻 ，抑 1 e •}. 

i=l 

Hence, 点 ([0, 1]) = (A:/«)r(„_ 灸 ） is an estimator of //([0, 1]) = 2. Therefore, we 
propose to choose the largest k for which ( 众 /n) 厂 („_ 灸 ） is close to two. Obviously, 
this is not more than a heuristic and should be formalized in some way. Also, it is 
not known if it leads to an optimal choice according to some criterion. Anyway, 
the plot suggests ko = 337 as a reasonable choice. Replacing ^([0, 1]) by its true 
value then leads to the estimator 

H(') = — l{n > 厂 ( w — 知 )， _ e . }， (9.71) 

K ° / =i 

see (9.53). Figure 9.7(b) shows a plot of H([0, w]) as a function of w; g [0, 1]. 
The Pickands dependence function corresponding to H is 

2 n 

如 ) = rE 1{r; > r(„ — 知 )}max{(l - t)wn,twi2}, (9.72) 

如 i=i 

see (9.56). As in (9.58), A can be modified into A to obtain an estimate satisfying 
all the requirements to be a Pickands dependence function, although in this case 
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Figure 9.8 Loss-ALAE data: (a) Estimates of Pickands dependence function: 
asymmetric logistic model via censored likelihood ( - ), bilogistic model via cen¬ 
sored likelihood ( - ) and point-process likelihood ( - ), and non-parametric 

estimate (.) obtained by modification of (9.72) via (9.58). (b) Quantile curves 

Q(F, p) of (9.73) for p = 0.98, 0.99, 0.995, 0.999 for asymmetric logistic model 
and GP margins estimated jointly via censored likelihood. 
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the difference between A and A turned out to be negligible. The function A is 
represented by the dotted line in Figure 9.8(a). 

Alternatively, we may fit one of the parametric models of section 9.2 via the 
point-process or censored likelihoods of section 9.4.2. For both losses and ALAEs, 
we need to choose a threshold so that the approximations (9.65) and (9.67) are valid. 
Recall that the approximations entail that the marginal distributions of excesses 
above the corresponding thresholds are modelled by a GP or GEV distribution as 
in (9.64) or (9.68) and the dependence structure by that of a multivariate extreme 
value distribution. Sometimes, marginal and dependence considerations point to 
different thresholds; the required modifications of the methods are described Dixon 
and Tawn (1995). 

For the Loss-ALAE data, we propose to choose the thresholds (u\, ui) in such 
a way that the total number of observations for which there is an exceedance in at 
least one coordinate is approximately ko = 337, the 众 -value found in Figure 9.7(a). 
Simplifying further, we propose Uj = x^-k^j for k\ = [(k + 1)/2」=169; here 
< x^ n )j denote the observations in the jth coordinate in ascending 
order. The resulting thresholds are u\ = 88 803 for Loss and U 2 = 23 586 for 
ALAE. Marginally fitting the GP by maximum likelihood to the threshold excesses 
led to (&un) = (79 916, 0.52) for Loss and (a 2 , y 2 ) = (20 897, 0.47) for ALAE. 
The goodness-of-fit was confirmed by W -plots (not shown) as in section 5.3.2. 

We fitted the asymmetric logistic model (9.7) with the censored likeli¬ 
hood (9.69)—(9.70) and the bilogistic model (9.9) with the censored likelihood 
and the point-process likelihood (9.66). As for the component-wise maxima in 
section 9.3.3, imposing the constraint \J/i = l did not significantly decrease the 
likelihood. The parameter estimates are summarized in Table 9.2 and the Pickands 
dependence functions are shown in Figure 9.8(a). 

Comparing the estimated Pickands dependence functions in Figure 9.8(a) with 
those in Figure 9.5(b) confirms our earlier findings about the inaccuracy of the 

Table 9.2 Loss-ALAE data: Estimates (standard errors - ** if observed infor¬ 
mation matrix was near-singular) for marginal and dependence parameters for 
asymmetric model (x/fi = 1) with censored likelihood and bilogistic model with 
censored and point-process likelihoods. 


Model 

Loss 

ai/1000 

n 

ALAE 

a 2 /1000 

Y2 

Dependence 

Asymmetric logistic 

82 

0.58 

23 

0.51 

a = 0.66 (0.04) 

(Censored) 

(1.3) 

(0.10) 

(0.4) 

(0.09) 

\l/ 2 = 0.89 (0.15) 

Bilogistic 

84 

0.59 

25 

0.47 

a = 0.55 (0.09) 

(Censored) 

(2.1) 

(0.10) 

(2.5) 

(0.10) 

P = 0.76 (0.05) 

Bilogistic 

84 

0.79 

25 

0.64 

a = 0.54 (**) 

(Point-process) 





p = 0.57 (**) 
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approximation 1 — F(x\, X 2 ) ^ /{I — 1 — Recall that this approx¬ 

imation underlies both the non-parametric methods as well as the point-process 
likelihood. It appears now that the non-parametric estimate (9.72)，as well as the 
one from the bilogistic model fitted with the point-process likelihood, is biased 
towards stronger dependence. This is a consequence of the undervaluation of the 
probability of joint extremes by the mentioned approximation, see the explana¬ 
tion after (8.93). On the other hand, the censored likelihood is based on the more 
accurate approximation F(x\, X 2 ) ^ exp[—/ {— log F\{x\), — log /^(义 2 )}]， and the 
resulting estimates, both with the asymmetric logistic and the bilogistic models, 
are much closer to the ones obtained from component-wise maxima. The standard 
errors of the censored-likelihood estimates are much smaller than their component¬ 
wise maxima counterparts, reflecting the more efficient use of information of the 
threshold approach. 

We conclude with a picture of the quantile curves 

Q(F, p) = {(xi,x 2 ) : F(x\,X 2 ) = p }， 0 < p < 1, (9.73) 


with F as in the model (9.67) underling the censored likelihood and with asymmet¬ 
ric logistic dependence structure. The quantile curves are shown in Figure 9.8(b) 
for p = 0.98, 0.99, 0.995, 0.999. Since the model (9.67) can be written as 


F(x\,X 2 ) = exp 


log{Fi(xi)F 2 (x 2 )}A 


log{F 2 (x 2 )} 

log{Fi(^i)^2fe)}, 


with marginal estimates Fj(xj) as in (9.68) and with A(w) = A(w, 0), the Pickands 
dependence function corresponding to the estimated dependence parameter vector 
0, we have F(x\, X 2 ) = p if and only if there exists w g [0, 1] such that F\{x\)= 
p([-w)/A(w) an( j f 2 (x2) = p w ! A ^ w \ Therefore, the quantile curve can be computed 
from 


Q{F,p) = {(#{//—}， ⑽ }) : we [0,1]}. 

For fixed w G [0, 1], point-wise confidence intervals could be added (not shown) 
to the quantile curves from the observed information matrix and the delta method. 


9.5 Asymptotic Independence 

Everything so far in this chapter was based on multivariate extreme value distri¬ 
butions. The justification is to be found in the theory of Chapter 8. Still, within 
the class of max-stable distributions, the only possible type of asymptotic indepen¬ 
dence is, in fact, perfect independence. This makes the class rather inappropriate 
for modelling data that exhibit positive or negative association that only gradually 
disappears at more and more extreme levels. To properly handle such cases, we 
are obliged to leave the by-now familiar framework of extreme value distributions 
and look for a class of models describing the tails of asymptotically independent 
distributions in a more refined way. 
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In section 9.5.1, we introduce a number of coefficients of extremal dependence 
useful in assessing whether a bivariate distribution is asymptotically dependent and, 
within each case, in giving a relative measure of strength of dependence (Coles 
et al. 1999). In particular, we find that the so-called coefficient of tail dependence 
(Ledford and Tawn 1996) is most useful in distinguishing asymptotic dependence 
from asymptotic independence and, within the class of asymptotically independent 
distributions, positive from negative association. Several methods to estimate this 
coefficient are described in section 9.5.2. Finally, section 9.5.3 describes a general 
model, due to Ledford and Tawn (1997), for the joint survivor function of a bivari¬ 
ate distribution, encompassing both asymptotic dependence as well as various types 
of asymptotic independence. We also discuss a number of inference techniques for 
this joint tail model, some of which are new. 

9.5.1 Coefficients of extremal dependence 

Asymptotic dependence 

Let (Zi, X 2 ) be a bivariate random vector with distribution function F and marginal 
distribution functions F\ and F 2 . For simplicity, we will assume throughout that F\ 
and F 2 are continuous. Assuming first that F\ and F 2 are identical, a quite natural 
coefficient of extremal dependence between X\ and X 2 at extreme levels is 

X = lim P[X 2 > x \ X\ > x], (9.74) 

provided the limit exists; here, denotes the right end-point of the common 
marginal distribution (Coles et al. 1999). Definition (9.74) can be generalized to 
the case where the marginal distribution functions F\ and F 2 are non-identical. 
The variables Uj = Fj(Xj) ()=1,2) are uniformally distributed on (0, 1). Now 
define 


X = lim P[U 2 > u \ U\ > u], (9.75) 

M 个 1 

again provided that the limit exists. Observe that (9.74) is indeed a special case 
of (9.75). 

The number x can be interpreted as the tendency for one variable to be extreme 
given that the other is extreme. When x = 0, the variables are said to be asymp¬ 
totically independent, whereas if 0 < x < 1, they are said to be asymptotically 
dependent. Observe that the condition for asymptotic independence, that is, x = 0, 
coincides with the necessary and sufficient condition (8.100) for F to be asymp¬ 
totically independent in the sense described there. Hence, if F\ and F 2 are in the 
domain of attraction of univariate extreme value distributions G\ and G 2 respec¬ 
tively, then x = 0 if and only if F is in the domain of attraction of the bivariate 
extreme value distribution G(x, y) = G\(x)G 2 (y). 

Recall from section 8.2.6 that the copula function of F, denoted by C = Cf, 
is equal to the distribution function of the pair (f/i, U 2 ), that is, 

C(u\, U 2 ) = P[U\ < u\, U 2 < U 2 ] = F{F^~ (wi), O 2 )} (9.76) 



344 


STATISTICS OF MULTIVARIATE EXTREMES 


for (u\, U 2 ) G [0, l] 2 , where, as usual, the arrow denotes the left-continuous inverse 
of a function. As the copula contains all information about the joint distribution 
of X\ and X: except for the marginal information, it can be interpreted as the 
dependence structure associated with X\ and X 2 . 

Now, defining 


X(w) = 2 - 


log C(w, u) 
log w 5 


0 < w < 1, 


we have 


l — C(u, u) 

X(u) = 2 - , + o(l) = P[U 2 >u\U l >u]-^ 


(9.77) 


u — 1, 


whence 


lim x(m) = x- 

M —^1 


(9.78) 


In general, the function x( u ) is bounded from below and above by 


2 - 


log{max(2w — 1,0)} 
log w 


< X(w) < 1, 0 < w < 1. 


These bounds follow from the respective bounds 

max(2w — 1, 0) < C(w, u) < u, 0 < u < 1, 


(9.79) 

(9.80) 


the left-hand side corresponding to perfect negative dependence and the right-hand 
side to perfect positive dependence. 

Next to providing the limit x，the function x( u ) also provides some insight 
in the dependence structure of the variables at lower quantile levels. In particular, 
x(u) is less than, equal to or greater than 0 if and only if C(m, u) is less than, 
equal to or greater than u 2 respectively. Since C(m, u) = u 2 corresponds to the 
case of exact independence, we find that the sign of x( u ) determines whether the 
variables are positively or negatively associated at quantile level u. 

In the special case that C is a bivariate extreme value copula with Pickands 
dependence function A as in (8.54), we have C(u, u) = u e with 0 = 2A(l/2) € 
[1,2] the extremal coefficient of (8.56). In particular, x( M ) = 2 — ^ g [0, 1], con¬ 
stant in 0 < w < 1. As a consequence, estimates of x( u ) can be used not only to 
gain information on the limiting behaviour as w —> 1 or the dependence structure 
at lower quantile levels but also as a diagnostic for membership to the bivari¬ 
ate extreme value class. More generally, if C is in the domain of attraction of a 
bivariate extreme value copula in the sense of (8.80), then by (8.92) we also have 
x=2-e. 


Asymptotic independence 

Within the class of asymptotically dependent variables (0 < X < 1) the value of 
X increases with increasing degree of dependence at extreme levels. The measure 
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fails, however, to discriminate between the degrees of relative strength of depen¬ 
dence for asymptotically independent variables (x = 0). For that purpose, a quite 
natural alternative measure of dependence x has been defined, analogous to x ， 
but based on a comparison of joint and marginal survivor functions of U\ and U2 
(Coles et al. 1999). 

With the copula survivor function defined as 

C(u\, U2) = P[U\ > U\, U 2 > U2] = l — U\ — U2 C(u\, U2) 


for (u\, U 2 ) € [0, l] 2 , let 




(9.81) 


log C(m, u) 

the precise definition being chosen for scaling convenience. From (9.80), we get 


21og(l — u) 
log{max(l — 2u, 0)} 


—1 < / (w) < 1, 0 < w < 1. 


(9.82) 


Then, as a second limiting dependence measure, we define 


X = limx( W ), (9.83) 

u ^>-1 

provided the limit exists. By (9.82), we have —1 < x < 1. 

For asymptotically dependent variables, we have x = 1; for asymptotically 
independent variables, we have — 1 < X < 1, and x provides a limiting measure 
that increases with relative dependence strength within this class. As a result, 
the pair (x ， x) can be used as a one-dimensional summary of extremal depen¬ 
dence: if x = 1 and 0<X < 1, the variables are asymptotically dependent and 
X is a measure for strength of dependence within the class of asymptotically 
dependent distributions; if — 1 < x < 1 and x = 0, the variables are asymptoti¬ 
cally independent, and x is a measure for strength of dependence within the class 
of asymptotically independent distributions. 


The coefficient of tail dependence 


Rather than to transform the original random variables X\ and X2 to uniform 
margins, it is also convenient to transform them to standard Frechet margins by 
Zj = — 1 / log Uj for 7 = 1,2. Clearly, this leaves the copula invariant and hence 
does not affect the discussed dependence measures. The joint survival function of 
(Zi, Z 2 ) can be found in terms of C through 


P[Z X > Zi, z 2 > Z 2 ] = <5(e— 1/Z1 ， e— 1 免） (9.84) 

for 0 < Zj < oo (j = 1, 2). Since P[Zj < z] = exp(— 1/z) for z > 0 and 7 = 1 , 2, 
we have P[Zj > z] ~ 1/z as z ^ 00 . 



346 


STATISTICS OF MULTIVARIATE EXTREMES 


Next to x and x，Ledford and Tawn (1996) introduce a third dependence 
coefficient by assuming that the joint survivor function of Z\ and Z 2 is a regularly 
varying function: 


P[Zi >Z, z 2 >z] = C{z)z~ ll \ z > 0. (9.85) 

Here, 77 is a positive constant, called the coefficient of tail dependence, and £ is a 
slowly varying function, that is, C{xz)/C(z) 1 as z —> 00 for all 0 < x < 00 . 
The rate of decay in (9.85) is primarily controlled by rj. Since P[Z\ > z, Z 2 > 
z] < 1 — exp(—1/z) 〜 1/z, we must have rj < \. Exploiting the fact that P[Z\ > 
z, Z 2 > z] = P[min(Zi, Z 2 ) > z], we can identify r] as the tail index of the uni¬ 
variate variable T = min(Zi, Z 2 ). Ledford and Tawn (1996) motivate their model 
through examples. The wide applicability of (9.85) is demonstrated by the extensive 
list of examples in Heffernan (2000). Still, the (somewhat pathological) counterex¬ 
amples in Schlather (2001) show that (9.85) neither implies nor is implied by the 
familiar domain-of-attraction condition. 

In (9.85)，if £(z)z 1 _ 1 / " 〜 P[Z 2 > z \ Z\ > z] converges as z —> 00 , the limit 
is equal to x. Moreover, from (9.84)，it follows that 

C(m, u) = jC(— l/logu) (— logM ) 1 /"， 0 < w < 1, 

and thus, by (9.81), 


X = lim x(w) = 2/7 — 1. 

As a consequence, if ij = l and lim z _ >00 C{z) = c for some 0 < c < 1 , then x = 1 
and the variables are asymptotically dependent of degree x = c - On the other hand, 
if 0 < 77 < 1 or if 巧 =1 and lim^^oo C{z) = 0 , then x = 0 and the variables are 
asymptotically independent of degree / = 2 /y — 1 . 

Within the class of asymptotically independent variables, three types of inde¬ 
pendence can be identified according to the sign of x = 2^ — 1 (Heffernan 2000). 
First, when l/ 2 < 77 <lor? 7 =l and C(z) 0 as z ^ 00 , observations for which 
both Z\ and Z 2 exceed a large threshold z occur more frequently than under exact 
independence (positive association). Second, when r) = 1/2， extremes of Z\ and 
Z 2 are near independent and even exactly independent in case C{z) = 1. Finally, 
when 0 < r] < observations for which both Z\ and Z 2 exceed a large threshold 
z occur less frequently than under exact independence (negative association). All 
in all, the degree of dependence between large values of Z\ and Z 2 is determined 
by r], with increasing values of r] corresponding to stronger association. For a given 
r], the relative strength of dependence is characterized by C. 

Finally, remark that the whole story can be repeated if we transform the 
variables Xj to standard Pareto margins by Zj = \/{l — Fj(Xj)} rather than to 
standard Frechet margins. The joint survivor of (Z\, Z 2 ) is then given by 

P[Z X >zi, Z 2 >z 2 ] = C(1- 1/zi, 1 - 1/Z2). (9.86) 
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The only difference will be that the slowly varying function C corresponding to 
Pareto margins will have a different second-order behaviour than the one corre¬ 
sponding to Frechet margins. 

Example 9.1 The bivariate extreme value copula with logistic dependence struc¬ 
ture (9.6) is given by 

C(u u u 2 ) = exp [-{(- log Ml ) 1/ff + ( - logMo) 1 ^}"] 

with parameter 0 < a < 1. Perfect independence arises as a = 1, while 0 < a < 1 
leads to asymptotic dependence. The bivariate survivor function corresponding to 
standard Frechet margins (9.84) satisfies 

P[Zi >Z, z 2 >z] = (2- 2 a ) z - 1 + (2 la ~ l - 1)Z~ 2 + o(z~ 2 ) ( 9 . 87 ) 

as z oo, while transforming to standard Pareto margins (9.86) gives 

P[Z { >Z, z 2 >z] = (2 - 2 a )z~ l + ( 2 2o! - 1 - 2 a - l )z~ 2 + o(z~ 2 ) (9.88) 

as z —> oo. If 0 < a < 1, then in both cases we find, as expected, a coefficient of 
tail dependence r] = 1 and a slowly varying function C converging to x = 2 — 2 a . 

Example 9.2 The bivariate Farlie-Gumbel-Morgenstern copula is given by 

C(u\, ui) = u\U2{\ + a(l - u\)(l - W2 )}， 


with parameter —1 < a < 1. For a = 0, a > 0, and a < 0, we get exact inde¬ 
pendence, positive dependence, and negative dependence, respectively. Complete 
dependence cannot be achieved under this model. As 


X(w) = 2- 


log[^ 2 {l+q(l-^) 2 }] 
log w 


0 < w < 1, 


we get x( u ) ^ X = 0 as w —>• 1, that is, all distributions in this family are asymp¬ 
totically independent. Examining the relative strength of dependence within the 
class of asymptotically independent variables, we notice that 


/⑻ = 


21 og(l — u) 

log[l — 2m + u 2 {\ + a(l — u) 2 }] 


- 1 , 


0 < w < 1, 


so x equals 0 for a > —1 (near independence) and —1/3 for a = —l (negative 
association). Transforming to standard Frechet margins leads to the joint survivor 
function (9.84) 


尸 [Zi > z, Z 2 > z]= 


a + 1 


3a + 1 55a + 7 

z 3 + 12z 4 



z ^ oo. 


This expansion allows us to identify r] and C in (9.85) as a function of a. In case 
a > — 1 , we have r] — 1/2 and C{z) = (a + 1 ) — (3a + l)z _1 + o(z _1 ) as z -> oo 
(near independence); in case a = — 1 , we have 77 = 1/3 and C{z) = 2 — 4z _1 + 
o(z _1 ) as z —> 00 (negative association). Notice that x = lim^oo C(z)z 1 ~ i ^ r, and 
X = 2rj — l, 2 ls expected. 
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Example 9.3 The bivariate normal distribution with correlation p < 1 is a prime 
example of asymptotic independence (Sibuya 1960). The joint survivor function 
(9.84) for margins transformed to the standard Frechet distribution satisfies 

P[Z\ > z ， > z ] 〜 c p (logz)— p A 1 +/ 0 ) z _ 2 八 1 仰)， z ^ oo, 

where q = (1 + p) 3 / 2 (l — p) _1 / 2 (47r) _ "( 1+p ) (Reiss 1989). In particular, the dis¬ 
tribution is asymptotically independent (/ = 0) with x = P and " = (1 + p)/2. 
Within the class of asymptotic independence, the cases of positive association, 
near independence, and negative association arise as p > 0， p = 0, and p < 0, 
respectively. 

The bivariate normal distribution with correlation 0 < p < 1 illustrates that 
dependence at intermediate levels, however strong, does not necessarily imply 
asymptotic dependence. This may lead to problems when we apply techniques 
based on bivariate extreme value dependence structures, where choices are limited 
to asymptotic dependence or exact independence. For instance, Ledford and Tawn 
(1996) show that the score test for independence using the censored likelihood 
(9.69)-(9.70) with logistic model will nearly always reject independence in case 
data are generated from a bivariate normal distribution with positive correlation. 
Still, extrapolating from an asymptotically dependent model fitted to the tail of the 
bivariate normal distribution will lead to overestimation of the probability of the 
occurrence of joint extremes. 


Data example 

For the Loss-ALAE data of Figure 9.1, an informal picture of the dependence 
functions x(m) and x( u ) can be created simply by plugging in empirical estimates 


C(u, u) = 

1 

=—/ l{un < U, Ui2 < u) 
n 

i=l 

C(u, u)= 

1 n 

=— 〉 1{m ? i > M, Ui2 > u) 

n 丄 ^ 


in expressions (9.77) and (9.81) (Coles et al. 1999). Analyzing the behaviour of the 
empirical versions of x(w) and x(u) as u tends to 1 can give an idea of the form 
of extremal dependence between the variables. Figure 9.9 shows estimates and 
95% point-wise confidence intervals for x( u ) and x(w). The confidence intervals 
are based on bootstrap samples obtained by sampling with replacement from the 
original data (x,i ， x/ 2 ), i = 1，…， as suggested in Fermanian et al. (2004). Also 
shown are the cases of perfect positive dependence, exact independence and perfect 
negative dependence. 

As > 0 for w < 1, there is evidence for dependence of the variables at 
lower quantile levels. It appears that x(w) ^ 0.4 for all u, even for u close to 1, 
suggesting an asymptotically dependent distribution that is possibly of the bivariate 
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extreme value type. However, notice that for u close to 1, the point-wise confi¬ 
dence intervals cover a large range of possible limits, including 0. Moreover, x ( u ) 
seems to be smaller than 1 , which is in contradiction with the hypothesis of an 
asymptotically dependent distribution. As a result, on the basis of the above infor¬ 
mal analysis only, it is difficult to make a decision between asymptotic dependence 
and asymptotic independence for the insurance data. This shows the need for more 
formal diagnostics. 

9.5.2 Estimating the coefficient of tail dependence 

The coefficient of tail dependence, rj, was found to be most useful in distinguishing 
between asymptotic dependence or asymptotic independence, and, within the latter 
class, between positive association, near independence, or negative association. 
This makes the problem of estimating rj, the topic of this section, a particularly 
relevant one. 


Hill estimator and maximum likelihood estimator 

Assumption (9.85) entails that the univariate variable T = min(Zi, Z 2 ) has a regu¬ 
larly varying tail with index —l/rj; here, Zj can be either — 1/log Fj(Xj) (standard 
Frechet margins) or 1/{1 — Fj(Xj)} (standard Pareto margins). Therefore, rj and 
hence 又 = 2r] — 1 can be estimated as the tail index of T, for instance, by the 
univariate techniques in Chapters 4-5. Notice also that, subject to convergence of 
£(z) as z -> 00 and ij being equal to 1 , the dependence parameter x can be esti¬ 
mated as the scale parameter of T for large values of z, as in that case C{z) is 
approximately constant and equal to x. 

Given a sample of independent observations (Xu, Xi 2 ), i = 1 ， … ， n, Ledford 
and Tawn (1996) propose transforming the data to have approximate standard 
Frechet margins by 

Zij = -1/logFjiXij), / = 1 ， ... ， "，7 = 1,2, (9.89) 


with Fj estimates of the marginal distribution functions Fj, typically by empiri¬ 
cal marginal distribution functions and incorporating extreme value estimators for 
the marginal tails. Alternatively, we may transform to standard Pareto margins by 
= 1/{1 — Fj(Xij)}. In any case, the 7/ = min(Z/i, Z/ 2 ), i = 1, ... ， " ， approx¬ 
imately form an independent sample distributed like T. Denote the order statistics 
of the Ti by T hn < - - < T n ， n . 

We can use the 7} to estimate 巧 ， for example, by the Hill (1975) estimator (see 
section 4.2) 


I k 

^ = 〉 : log Tn-k+i,n _ log T n —k, n 


(9.90) 
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or by the maximum likelihood estimator in a peaks-over-threshold setting, where 
exceedances of T above a high-enough threshold u are assumed to follow a 
GP distribution 

P[T > u-\- z \ T > w] = (1 + rjz/cr)~ l/ri , 0 < z < oo, (9.91) 

with shape parameter 0 < ^ < 1 and scale parameter a = a(u) >0. 

Under the model (9.91), Ledford and Tawn (1996) suggest to test for asymp¬ 
totic independence (/ = 0) by testing rj = 1 against the alternative 0 < rj < 1. As 
mentioned before, observe that under model (9.85), the hypothesis ^ = 1 is implied 
by, but is not equivalent to, asymptotic dependence. Note, however, that the special 
case rj = 1 and C{z) — > 0 as z oo tends only to have theoretical value, so that, 
in practice, it is safe to assume that " = 1 is equivalent to asymptotic dependence. 

So, let Li be the maximized likelihood for (9.91) for a given threshold u and 
Lo be the corresponding maximized likelihood under the restriction t] = 1. Since 
the null hypothesis corresponds to a boundary value of the parameter space, the 
likelihood ratio statistic D = 2(\ogL\ — log Lo) should be compared to a one- 
half chi-squared distribution with one degree of freedom (Self and Liang 1987), 
resulting in the p-value P[；( 2 > D]/2. Still, it is likely that the true estimation 
uncertainty is larger than the one reflected in the likelihood ratio tests or profile 
likelihood-based confidence intervals as we falsely assumed that the 7} are inde¬ 
pendent, that each marginal distribution is estimated exactly and that the parametric 
specification of model (9.85) is correct. 


Estimators of Peng (1999) and Draisma et al. (2002) 

In order to avoid underestimation of the true uncertainty in the estimates of r] as 
a consequence of the uncertainty introduced by possible marginal transformations, 
which is, for example, not accounted for in the above procedure, Peng (1999) and 
Draisma et al. (2004) propose estimating rj in (9.85) through certain non-parametric 
alternatives that do not depend on the marginal distributions. 

Assumption (9.85) formulated for standard Pareto Zj = 1/{1 — Fj(Xj)} implies 


Hm TO > X 2 > F^(l-ts)] =slh 

P[X x > F^(l - 0, X 2 > F^(\ - t)] 


(9.92) 


for s > 0. In both Peng (1999) and Draisma et al. (2004), this limiting relation is 
used to construct a non-parametric estimator for rj on the basis of the empirical 
distribution function of the original observations {Xu, X/ 2 ), i = l,... ,n. With 
X 、 i, n \j the /th ascending order statistic of the 7 th coordinate sample (X / ; )^ =1 
(7 = 1 , 2 )，define 


> ^(n—k,n), 1 ? ^(n—k,n),2) 


(9.93) 
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for k = 0,..., n — l. Notice that S n (k) depends on the data through their ranks 
only and that S n (k)/n can be seen as the empirical counterpart of P[X\ > (1 — 

k/n), X 2 > F ^~(1 — k/n)]. 

Taking logarithms on both sides of (9.92) with s = 2 leads quite naturally to 
the estimator 


log 2 

\og{S n (2k)/S n (k)} 


(9.94) 


proposed by Peng (1999). Integrating both sides of (9.92) with respect to s from 
0 to 1 gives the estimator 


EU Sn(j) 

ks n (k) — S n (j) 


(9.95) 


as introduced by Draisma et al. (2004). Note that fj\ is based on S n (k) and S n (2k )， 
while 7)2 is constructed from the S n (j) for j only up to k. 

Peng (1999) and Draisma et al. (2004) establish asymptotic normality of their 
estimators under certain second-order conditions on the limiting behaviour of 
P[Z\ > x\z, Z 2 > X 2 Z] for 0 < xy < 00 as z —> 00 . The second-order conditions 
by Peng (1999) prohibit the slowly varying function C(z) in (9.85) to converge 
to zero as z — > 00 , so that the hypothesis of asymptotic independence (x = 0) 
is equivalent to " < L A drawback is that the distributions such as the bivariate 
normal (Example 9.3) are excluded. The second-order conditions by Draisma et 
al. (2004) are less restrictive in that they do allow for C(z) 0 as z -> 00 . 

Draisma et al. (2004) prove asymptotic normality not only for their estimator 
rj 2 (9.95) but also for the estimator f}\ (9.94) by Peng (1999)，the Hill estimator 乃 3 
(9.90), and the maximum likelihood estimator 7)4 arising from (9.91) with threshold 
u = Tn—k’n. They transform the data to standard Pareto margins by 


1 \ 

Zij = - 7 - where Fj(x) = -- l(Xij < x). (9.96) 

1 - Fj(Xij) n + { U 

Observe that the Z" depend on the original data X" through the ranks only. 
Draisma et al. (2004) show that under certain growth conditions for k = k n ，the 
standardized estimators {S n {k)} l ^ 2 (j]i — rj) (i = 1, 2) and — rf) (i = 3, 4) 

are asymptotically normal with mean 0 and certain asymptotic variances of (i = 
1 ， …， 4 ). The expressions for the 0 [ are rather complicated, but Draisma et al. 
(2004) propose a way to estimate them from the as well. Denoting such esti¬ 
mates by asymptotic confidence intervals for 77 can easily be constructed, and 
rj = l can be tested against the alternative ij < \. For instance, denoting by 
the estimated root variances {S n (k ， k)}~ l ^ 2 &i (/ = 1, 2) and k~ x ' 2 ai (i = 3, 4) in 
case rj = 1 , we can reject rj = l in favour of < l if fii < l — z a ^(i), with z a the 
(1 — a)-quantile of the standard normal distribution. 
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Estimator of Beirlant and Vandewalle (2002) 


The bias of estimators as introduced by Ledford and Tawn (1996), Peng (1999) 
and Draisma et al. (2004) depends heavily on the underlying bivariate distribution. 
If, for instance, the dependence parameter a in the logistic model of Example 9.1 
is close to, but smaller than, one, the first-order terms in (9.87) and (9.88) will 
be small and dominated by the second-order terms unless z is large (note that the 
situation is worse for Pareto margins (9.88) than for Frechet margins (9.87), as the 
second-order terms are smaller in the latter than in the former, although for other 
distributions the situation may be reversed). As a result, thresholds will need to 
be chosen high enough in order not to get estimates of rj that are biased towards 
asymptotic independence. In view of the critical difference between asymptotic 
dependence and asymptotic independence regarding out-of-sample inference, it is 
therefore highly desirable to have available estimation methods for r] that can 
cope with a slow rate of convergence in the model assumption C(xz)/C(z) 1 
as z —> oo. 

In this respect, Beirlant and Vandewalle (2002) suggest an estimator based on 
scaled log-ratios 


yj = j log 


- T n -k,i 


T n - 


J，n 


T n -k,n 


j = 1 ， ... ， k _ 1, 


of excesses over a large threshold T n —k, n . ， here, the 7 ； are defined as above, con¬ 
structed from the data transformed to either standard Pareto or standard Frechet 


margins. The coefficient of tail dependence, rj, is then estimated by maximum 
likelihood from the exponential regression model 


Yj 


V 


r] 


1 - G7 印 


Ej, 


, k — 1, 


(9.97) 


with Ej independent standard exponential random variables, see section 5.4. 

Beirlant and Vandewalle (2002) prove asymptotic normality of this estimator 
under the same second-order conditions as in Draisma et al. (2004) but restricted to 
the case of asymptotic independence (x = 0). The estimator has a smaller bias than 
other well-known estimators, whatever the marginal transformations and underly¬ 
ing distribution. Under a second-order refinement of (9.97), minimization of the 
estimated asymptotic mean squared error leads to a diagnostic for selecting the 
optimal k to be used in estimating r]. 


Data example 

We transform the Loss-ALAE data of Figure 9.1 to approximate standard Frechet 
margins by 


Zij = 一 1 / log U(j i — 1 ,..., , j — 1,2, 

with Uij as in (9.1). The sample of minima, t[ = min(z ， i, zn), serves to construct 
maximum likelihood estimates and profile likelihood confidence intervals for " 
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based on the GP likelihood derived from (9.91), see Figure 9.10(a). Here, the 
threshold varies along the entire range of threshold probabilities for T. Transform¬ 
ing to standard Pareto margins by Zij = 1/(1 — Uij) gives the same qualitative 
results, see Figure 9.10(b). 

In view of the probable underestimation of the estimation uncertainty as reflected 
in the confidence intervals, we notice that although estimates of r] for almost all 
threshold probabilities between 0.5 and 0.9 seem to be close to 0.9, correspond¬ 
ing to a strongly positively associated form of asymptotic independence, the value 
rj = l, consistent with asymptotic dependence, is still covered by almost all con¬ 
fidence intervals. Neither do likelihood ratio tests consistently permit us to reject 
asymptotic dependence against asymptotic independence. 

Alternatively, Figure 9.11 shows point-wise estimates for r] using the estimator 
(9.94) of Peng (1999), together with critical values under which 77 = 1 is rejected in 
favour ofr]<l based on a 5% one-sided test (left) and two-sided 95% confidence 
intervals (right). Again, we cannot consistently reject asymptotic dependence. 

9.5.3 Joint tail modelling 

Modelling the tail of a multivariate distribution by an extreme value distribution 
as in section 9.4 limits the options for the extremal dependence structure to either 
asymptotic dependence or exact independence. In the bivariate case, this means that 
the probability P[X\ > F^~(l — 1/z), X 2 > _F 2 —(1 — 1/z)] of joint exceedances 
of the respective l/z tail quantiles in the two margins is of the order 0(z~ { ) 
(asymptotic dependence) or 0(z~ 2 ) (exact independence) as z ^ 00 . However, for 
asymptotically independent distributions with positive association, that is, 1/2 < 
rj < 1 in (9.85), this probability is in fact of the order Hence, for such 

distributions, the probability of joint extremes will be evaluated either too large, 
in case of an asymptotically dependent model, or too small, in case of the exactly 
independent model. 


The model of Ledford and Tawn (1997) 

A versatile model bridging the gap between asymptotic dependence and exact 
independence was introduced by Ledford and Tawn (1997). Before we can describe 
their model, we need some technical preliminaries. 

A function C : (0, oo) 2 —> (0, 00 ) is called bivariate slowly varying if there 
exists a function g : (0, oo) 2 —> (0, 00 ) such that 

lim :: 2 ) = g(zi,Z 2 ), 0 <zj <00 O' = 1,2) (9.98) 

t—oo L{t, t) 


and if this function g is homogenous of order zero, that is, 

g(sz\,szi) = g(zi,Z2), 0 < s < 00 , 0 < Zj <00 (j = 1, 2) 
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(Bingham et al. 1987). The homogeneity of g implies that there exists a function 
g* : (0, 1) 4 (0, oo) such that g(zi, Z 2 ) = g^{z\/(z\ + zi)} for all 0 < zj < 00 
(j = 1, 2). We call C ray independent if is constant and ray dependent otherwise. 
Furthermore, C is called quasi-symmetric if the function ^(^)/^(1 — w) is slowly 
varying at w; ^ 0 and w; ^ 1 . 

Now as in the previous sections, let (X\, X 2 ) be a random pair with distribution 
function F and continuous marginal distribution functions F\ and F 2 . Transform the 
vector to standard Frechet margins by Zj = — 1 / log Fj(Xj) for 7 = 1,2. Ledford 
and Tawn (1997) propose to model the joint survivor function of (Z\, Z 2 ) as 

P[Zy > Zl , z 2 > z 2 ] = C(zu zi)z ； Cl Z2 C2 , (9.99) 

with Cj > 0 for 7 = 1,2 and C a quasi-symmetric, bivariate slowly varying func¬ 
tion. Clearly, (9.99) implies (9.85) with 1/" = q + c* 2 . In this sense, the model of 
Ledford and Tawn (1997) provides an extension of the one of Ledford and Tawn 
(1996). All in all, (9.99) provides a smooth family of dependence models, incor¬ 
porating asymptotically dependent distributions as well as positively or negatively 
associated asymptotically independent distributions. 

The quasi-symmetry condition on C is imposed to identify c\ and C2. For, 
denoting C2 — c\ = k, we also have 


P[Zi > Zl ， Z 2 > Z2] = C(zu Z2)(ziZ2T m (9.100) 

for 0 < Zj < 00 (j = 1,2)，where the function 

C(ZUZ2) = (Zl/Z2) K/2 C(ZUZ2) (9.101) 

is bivariate slowly varying with limit function g(z\, zi) — (Zi/Z 2 )" 2 茗 (Zi ， zi) and 
ray dependence function g^(w) = {w/(l — w)} K ^ 2 g^{w) for 0 < uj < 1. As g^(w)/ 
g*(l — uO is slowly varying at 0 and 1, the function ^(u;)/^(1 — w) is regularly 
varying at 0 and 1 with indices —k and /c, respectively. Observe that an alternative 
and perhaps simpler and less restrictive way to define the joint tail model is via 
(9.100) but without imposing regular variation of — uj) at 0 or 1. This 

is in fact the approach taken in Ramos (2003). 


Exponent measure 


If the joint tail model (9.99) holds, then there exist proper analogues of the exponent 
and spectral measures of a max-stable distribution. Starting point is the simple 
observation that 


lim 


P[Z\ > tzi, Z 2 > tzi\ 
P[Z\ > t, Z 2 > t] 


= gAzi/(Zl -h Z2 )}Zi C 1 Z2 C2 , 


(9.102) 


for 0 < Zj < oo (7 = 1, 2). This shows that g(z, z) = g*(l/2) = 1 for 0 < z < 00 . 
Now for 0 < f < 00 , define a positive measure A r (-) on (0, oo ) 2 by 


A t (B)= 


p[r l (z u Z2) g B] 
P[Zi > t, Z 2 > t] 
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for Borel sets B in (0, oo) 2 . Then (9.102) states convergence of A ? {(zi, oo) x 
(Z 2 , oo)} as t oo. Now by similar arguments as those leading to (8.71) or (8.94), 
this implies that there exists a positive measure A on (0, oo ) 2 given by 

A{(zi,oo) x (z 2 ,oo)} = g^lzi/(zi + Z 2)}« C2 (9.103) 

for 0 < Zj < oo (j = 1 , 2 ) and such that 

A r (*) -> A as f ^ oo in (0, oo] x (0, oo], (9.104) 

with ‘-V denoting vague convergence (Kallenberg 1983; Resnick 1987). Observe 
that in (9.104), the coordinate axes are excluded, in contrast to (8.94). The reason is 
that in case ^ < 1, the normalizing factor 1/P[Z\ > t ， Z 2 > t] = {C{t, 
in the definition of A t is of larger order than the factor t in the definition of in 
(8.94). In other words, for sets B hugging one or both of the axes, A t (B) may blow 
up to infinity if 77 < 1. Finally, observe that (9.104) suggests the approximation 

P[(Zu Z 2 ) e • ] ^ 尸 [Zi >t, Z 2 > t]A(r 1 -) (9.105) 

for large enough 0 < t < 00 . This approximation forms the basis of statistical 
inference procedures on the bivariate tail of (Zi, Z 2 ), see below. 

Clearly, equation (9.103) implies that 

A{(szu 00 ) x Oz 2 , 00 )} = s~ l/ri A{(z\,oo) x (Z 2 , 00 )}, 

for 0 < 5 < 00 and 0 < Zj < 00 (7 = 1, 2). Since rectangles of the kind (zi, 00 ) x 
(Z 2 , 00 ) form a measure-determining class in (0, oo) 2 , we obtain 

A(^ •) = 0 < s < 00 . (9.106) 

Property (9.106) should be compared with the corresponding homogeneity property 
( 8 . 11 ) of the exponent measure /x*. 


Spectral measure 


Define the measure Ha on (0, 1) by 


H a (B) = A 


(ZUZ 2 ) ^ (0, oo) 2 : Zl +Z 2 > 1, 


Zl 


Zl +Z2 


eB 


(9.107) 


for Borel sets B in (0, 1). By homogeneity in (9.106), 


A 


2 Zl 

(Zl ， Z2) e (0, 00 ) : Zl +Z2 > 厂， 


Zl +Z2 


eB 


r~ 1/r] H A (B) (9.108) 


for 0 < r < 00 and Borel sets B in (0, 1). Equation (9.108) implies that the measure 
A factorizes as a product measure when expressed in pseudo-polar coordinates 
T (zi, Z 2 ) = (r, w) with r = z\Z2 and w = z\/(zi + Z 2 ), that is, 

A o T~ l (drdw) = H A (dw). (9.109) 
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This is the spectral decomposition of A, to be compared with the spectral decom¬ 
position for the exponent measure /x* in (8.17). The measure H\ is the spectral 
measure of A. 

The spectral decomposition (9.109) implies for 0 < Zj < oo (j = 1,2), 


A{(Z!, 00 ) X (z 2 , 00 )} 
r /»oo 

l{rw > z\, r(l — w) > Z 2 }r]~ l r~ x ^~ x dr H^{dw) 

( 0 , 1 ) 


/( 0 , 1 ) Jo 


\Zl Z2 

Comparing this with (9.103) gives 

g^(w)w~ Cl (l - w)~ C2 


H A (dw). 


(9.110) 


f 

L ( 上， 

1 - 


ho,D 


1 - 

-^)\ 


l/" 


H A (dv), 


(9.111) 


for 0 < w; < 1. As g^(l/2) = 1, we find that the spectral measure Ha must satisfy 
the constraint 


f {minO, 1 — w)} 1/rl H A (dw) = 1. (9.112) 


If A is absolutely continuous with density 入 （ zi ， Z 2 ), then Ha is absolutely 
continuous as well, and its density can be calculated from and (c\, ci) as 
follows. The Jacobian of the transformation (zi, zi) i-> (r, w) = (zi + zi, z\/{z\ 4- 
Z 2 )) is equal to r _1 . Therefore, by the multivariate changes-of-variable formula, 

入 (Z 1 ， Z 2 ) = T]^ l r~ 2 ~ 1/rl h A (w). 

Since moreover 入 (zi ， zi) = 3 2 八 {(zi ， 00 ) x (Z 2 , oo)}/dzidz 2 , we obtain 

C!C 2 g^(w) + W；(l - w)g^(w)(2w - 1 + Cl - c 2 ) - g'J(w)w 2 (l - w) 2 

(d+cjw^id-w) 1 ^- 

(9.113) 


for 0 < w < 1. This derivation shows that when specifying parametric models for 
C in (9.99) and hence for g*，care has to be taken that the resulting spectral density 
Ha is indeed positive. 

A simpler way to specify parametric models satisfying (9.99) is directly via 
the spectral measure (Ramos 2003). If 0 < ^ < 1 and if // is a positive measure 
on (0, 1) satisfying (9.107), then we can define a probability distribution with joint 
survivor function as in the right-hand side of (9.110) restricted to 1 < < 00 

(j = 1, 2). This survivor function can be written as in (9.100) with bivariate slowly 
varying function 


^r],H(.Zl, Z2) = 


[ 

「 • 1 
min j 

1/2 

^2 w 

z\ /2 (l - w) 


/(o ， i) 

1/2 ’ 
^1 

4 /2 , 



H(dw) 
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for 1 < Zj < oo (j = l, 2), whose limit function, 蒼 " 付 ， is equal to the expression in 
the right-hand side of the above equation extended to all 0 < Zj < oo (j = 1,2). If, 
moreover, the corresponding ray dependence function g^,H is regularly varying 
at 0 and 1 with indices k and —/c, then we can define C^h by turning around 
(9.101), leading finally to the representation (9.100). 

By (9.104), the spectral measure Ha can be related to the distribution of 
Z\/{Z\ + Z 2 ) given that both Z\ and Z 2 are large in the sense that 


尸 [Zi Z2 > t, Z\/{Z\ + Z2) G - ] 

P[Z\ > t, Z 2 > t] 


t 


00 , 


(9.114) 


in the open interval (0, 1). In particular, if Ha is absolutely continuous with spectral 
density Ha, then 


lim 

t-^-oo 


P[Z\ z 2 > t, Wp < Zi/(Zi + z 2 ) < Wj] 
P[Z\ > t, Z 2 > t] 



h^(w)dw 


(9.115) 


for ail 0 < wo < w\ < 1. Observe that we do not allow the wj to be 0 or 1, as in 
case r] < l the limit would be infinity. 


Example 9.4 If C is ray independent, that is, if = l, then the exponent measure 
is A{(zi, 00 ) x (Z 2 , 00 )} = z[ Cl z^ C1 , while by (9.113), the spectral density is given 
simply by 


h(w\ ci, c 2 ) 


C\C2 


1 


Cl + C2 W； 1+Cl (l — W) 1+C2 5 

If, moreover, c\ = C 2 , then, as c\ + C 2 = l/r], 

1 


h(w; rf) 


4rj{w(l — m;)} 1 + 1 A 2 ")， 

It is not hard to check (9.112) directly for h(w; C 2 ). 


0 < w < 1. 


0 < w < l. 


(9.116) 


Example 9.5 If Z\ and Z 2 are independent standard Frechet random variables, 
then 


P[Zi > z\, Z 2 > Z2] = {1 - exp(-l/zi)}{l - exp(-l/z2)} 

= JC(ZU Z 2 )(ZlZ 2 )~\ 

where C(zi, Zi) is a ray independent, bivariate slowly varying function. In par¬ 
ticular, ci = C 2 = 1 and r] = 1/2. By (9.116)，the spectral density is given by 
h(w; 1/2) = 2~ l {w(l - w)}~ 2 . ' 

Example 9.6 Let the random pair (X\, X 2 ) have a bivariate normal distribution 
with standard normal margins and correlation —1 < p < 1. Transform the margins 
to the standard Frechet distribution by Zj = — 1 / log for j = 1,2, where <E> 
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is the standard normal distribution function. Ledford and Tawn (1997) show that 
the bivariate survivor function of (Zi, Z 2 ) can be written as 

P[Zi > Zl, z 2 > Z 2 ] = C{z\,Z 2 \ P)(ZlZ 2 r 1/(1+P) ’ 

with C{z\, Z 2 ; P) a ray independent, bivariate slowly varying function. Hence c\ = 
C 2 = 1/(1 + p), 77 = (1 + p)/2, and the spectral density is given by h{w; (1 + 
p)/2} as in (9.116). 

Example 9.7 Let (Zi, Z 2 ) have a bivariate extreme value distribution with stan¬ 
dard Frechet margins and exponent measure /i*, that is, P[Z\ < zi, Z 2 < = 

exp{-V*(zi, Z 2 )} with V^{z\,zi) = /x*{[0, oo) 2 \ [0, zi] x [0, zi]} for 0 < zj < 00 
(j = 1, 2), see (8.8). The joint survivor function of (Zi, Z 2 ) is given by 

P[Z\ > zi, Z 2 > Z 2 ] = 1 _ exp(-l/zi) - exp(-l/z 2 ) + exp{-V*(zi ， Z 2 )}. 

Assume that the Zi and Z 2 are not independent, that is, x = 2 — \4(1, 1) > 0. 
Recalling from (8.11) that 〜is homogenous of order — 1, we obtain 

P[Z\ > t, Z 2 > 〜厂 i /， t —> 00 , 


and hence, for 0 < Zj < 00 (j = 1 , 2), 

P[Z { > z\t, Z 2 > Z 2 t] 

lim - 

t->oo P[Zi > t, Z2 > t] 


x-'kr'+z ； 


yMuzi)} 


X~ 00 ) X (Z 2 , OO)}. (9.117) 


Therefore 


P[Zi > zi, Z 2 > Z 2 ] = C(zi, Z 2 )(zizi)~ 1/2 , 


where the function jC(z\, zi) = (Z1Z2)" 2 尸 [Zi > zi, Z 2 > Z2] is bivariate slowly 
varying with limit function 

g(ZuZ2) = X~\ziZ2) l/2 ^A(ZU OO) X (Z2, OO)}. 


The function g is homogenous of order zero, and g(zi, zi) = gAz\/{z\ + Zi)} with 
= X _1 {^(1 - w)} l/2 ^{(w, 00 ) x (1 - w, 00 )} 


1 - A(w) 
X{u>(l — w)}" 2 ’ 


0 < w < 1, 


where A(w) = \^{(1 — uO 1 ， u> -1 } is Pickands dependence function. Denoting the 
spectral measure of /x* by H as in (8.28)，we have by (8.48 )， 


8 ， (w) _ 1 - A(w) —A ， (0) _ H((0, 1]) 

g*(l - ~ l-A\l) — H([0, 1)) 


confirming that g^(w)/g^(l — w) is slowly varying at 0 and 1， that is, C is quasi- 
symmetric and c\ = C 2 = 1/2. From (9.117), we obtain that the exponent measure 
A is A = ； (一 V* with spectral measure Ha = x 1 H, indeed satisfying (9.107). 
In the special case of complete dependence, degenerates to a point mass of 
size two at w; = 1/2. 
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Point processes 


Convergence of point processes as in (8.98) under the domain-of-attraction con¬ 
dition can be formulated in the joint tail model (9.99) too. Let (Z n \, Z n 2 ), n = 
1,2,... be an independent sequence of random pairs distributed as (Z\, Z^) in 
(9.99). Define the sequence of point processes 


n 

N, l {-) = Y j nt ； \z iU Z i2 )e-}. 



Rather than normalizing by 1 as in (8.98), we normalize here by a sequence (t n ) n 
of positive numbers such that P[Z\ > t n , Z 2 > L] 〜 1/n as n 00 . Since the 
function 0 < t \-^ P[Z\ > t, Z 2 > t] = C(t, ， ) 厂 V" is regularly varying at infinity 
with index —l/rj, we must have t n = n v (n) for some slowly varying function 
(Bingham et al. 1987). In particular, if ^ < 1, then t n = o(n) as n —> 00 . 
Since, by (9.104), 

72 尸 Z2) G • ] -^ 八 (.)， n -> OO 

in (0, oo] 2 , Proposition 3.21 of Resnick (1987) implies that 

N n N, n ^ 00 , (9.118) 

where AMs a non-homogenous Poisson process on (0, oo] 2 with intensity measure 
A. Note again that we excluded the coordinate axes from the state space. The 
reason is that if r; < 1, the normalization by t n is too weak and can only control 
the (Z/i, Z/ 2 ) for which both coordinates are large (recall that the maximum of 
n independent standard Frechet variables is of order n). Therefore, the number of 
points in N n close to the axes will converge to infinity. Normalizing, on the other 
hand, by n rather than by t n would indeed control the points near the axes, but 
since the limiting measure in case of asymptotic independence is concentrated on 
the axes, there would remain in the limit no points in the interior. 

By (9.118), for 0 < z 7 - < 00 (j = 1, 2), 

P[Vi = l, ...,n : Z/i < t n z\ or Z i2 < t n Zi\ 

= P[N n [(z\, 00 ) X (Z 2 , 00 )} = 0] 

— exp[-A{(zi, 00 ) x (z 2 , 00 )}] = exp j-g* ( —二 


as w > 00 . This relation can also be obtained directly from (9.99). More interest¬ 
ingly, we can find the limit distribution of the component-wise maximum of the 
sub-sample consisting of those pairs (Z/i, Z/ 2 ), i = 1,..., n, that fall in the region 
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(tn, OO) 2 : for 1 < < oo (j = 1, 2), 

P[max{(Z n , Z i2 ) : / = 1,... ， n with (Z u , Z i2 ) > (t n ,t n )} < (t n zu t n zi)] 
=P[N n ({(hoo)x(l, oo)}\{(l,zi] x (1, z 2 ]}) = 0] 

^ exp [- A({(1, oo) x (1, oo)} \ {(1, zi] x (1, z 2 ]})] 

=exp{-g(zi, l)z7 Cl -以 1 ， Z 2 )Z 2° 2 + g(zi, zi)z\ cx z 1 ci \ 
as n ^ oo. 


Statistical inference 

The joint tail model (9.99) may be used for statistical inference on a bivariate dis¬ 
tribution in that region of its support where both components are large. As before, 
the analysis splits into inference on the margins and inference on the joint depen¬ 
dence structure. Estimates Fj (j = 1, 2) of the marginal distributions Fj are used 
to transform the original data (Xu, Xa) to approximate standard Frechet margins 
by Zij = — 1 / log Fj(Xij), and these transformed data are then assumed to follow 
the joint tail model (9.99). The margins may be estimated non-parametrically or 
semi-parametrically as explained in section 9.4.1. Alternatively, under a paramet¬ 
ric specification of (9.99), marginal and dependence parameters may be estimated 
jointly by maximum likelihood. In the text, we assume for simplicity that the mar¬ 
gins are known, so that we dispose of independent, identically distributed random 
pairs (Z/i, Z/ 2 ) following the joint tail model (9.99). 

We will not apply the methods to the Loss-ALAE data, since in section 9.5.2 
we found insufficient proof for asymptotic independence. However, multivariate 
extreme value methods will come into play in section 10.4.6 again when we analyse 
the extremes of certain Markov processes. There we will illustrate some paramet¬ 
ric techniques for asymptotic independence with suitable adaptations to Markov 
processes as in Bortot and Tawn (1998). 


Non-parametric inference. Combining (9.106) and (9.104), we find that the dis¬ 
tribution of (Zi, Z 2 ) satisfies the following scaling relation: For a Borel set B in 
(t, oo) 2 with t large and for 0 < 5 < 00 , we have by successive applications of 
(9.105)， 

P[(Z l ,Z 2 ) e P[Zj >t, Z 2 > t]A(r l B) 

=P[Z X >t,Z 2 > t]s 1/ri A(sr l B) 

^ s l/tl P[(Z u Z 2 ) e jB], (9.119) 

[For the approximations to work, B needs to be a continuity set of A, that is, 
A(dB) = 0, with dB the topological boundary of B.] Hence, if the set B does not 
contain any or only very few of the observations (Zn, Z/ 2 ), i = l,... ,n, we can 
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still estimate the probability p = P[(Z\, Zi) € B] by 

1 n 

p = s 1 ^-Tl{(Z n ,Z i2 )esB}. 
n 

i=\ 

Here 1 ) is an estimator of the coefficient of tail dependence as in section 9.5.2, 
and 0 < ^ < 1 is a scaling factor to be chosen such as to meet the two conflicting 
criteria of sufficiently scaling down the failure set B (small 5) and keeping the 
approximation (9.119) sufficiently accurate (large 5 ). The asymptotic properties of 
p are discussed in Draisma et al. (2004). 

Alternatively, we may estimate the tail of (Zi, Z 2 ) by first estimating the expo¬ 
nent measure A and secondly using the approximation (9.105). A naive estimator 
for the exponent measure A arises from replacing probabilities by empirical counts 
in that same approximation, 


ELi l{t- l (Z iU Z i 2 )e-} 
YH=\ l{min(Za, Z i2 ) > t) 


(9.120) 


Here, t acts as a threshold, the choice of which should strike a balance between a 
close approximation in (9.104) and a sufficient number of observations in the region 
(t, oo) 2 . The estimator A(-) does not satisfy the homogeneity property (9.106) and 
is therefore not directly suited to approximate the tail of (Z\, Z 2 ) through (9.105). 
However, we can turn A into an estimator for the spectral measure, 


^a(-) = A ({(Zl, Z2) € (0, oo) 2 : Zl + Z2 > 1 , Zl/(Zl + Z2) € • }) 

二 ELi l{Z/i + Z /2 > t, Zn/(Z n + Z i2 )e-} 

Y!i=\ l{min(Z/i, Z i2 ) > t} 

Observe that replacing probabilities by empirical counts in (9.114) leads to Ha as 
well. Now, combine this Ha with an estimate fj of the coefficient of tail dependence 
to find an estimator of A that does satisfy the required homogeneity property: 


A{(z 1; 00) X (Z 2 , OO)} 


/o,i) 




E：=1 l(Z n + Z ;2 > t)(Z n + Z n )-^{mm(Z n /zi ， Z i2 /z 2 )} l/f, 
E"=i llminCZ,!, Z l2 ) > t} 


The finite-sample or asymptotic properties of this estimator remain to be investi¬ 
gated. 


Parametric inference. Another possibility to perform statistical inference on the 
joint tail model (9.99) is within a parametric sub-model, analytically tractable but still 
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sufficiently flexible. As for models based on multivariate extreme value distributions, 
inference can be done by the censored-likelihood approach, see section 9.4.2. The 
case for asymptotic dependence or asymptotic independence is not always clear-cut, 
however, making it useful to quantify the uncertainty in a Bayesian set-up through the 
posterior distribution on the parameters of a model that allows for both asymptotic 
dependence and independence (Coles and Pauli 2002). 

It is not completely trivial to construct useful parametric models for the joint 
tail model (9.99). The modelling strategy adopted in Ledford and Tawn (1997) 
and Bruun and Tawn (1998) is to take a specification of the form C{z\, zi )= 
C^{z\/{z\ + Z 2 )}，where C^{w) = Kg^(w) for a positive constant K and a quasi- 
symmetric ray dependence function see section 10.4.6. As a model selection 
diagnostic, one can first estimate A by the non-parametric estimator A given in 
(9.120) and then identify and (ci, c^) by evaluating (9.103) at (w, l — w) and 
(w — 1, w) and assuming that is symmetric around 1/2. It is not always obvious 
that a certain parametric form for leads to a valid distribution; in particular, the 
spectral density in (9.113) should be checked to be non-negative over its whole 
range. A more natural approach, therefore, is to specify a parametric form for the 
spectral density itself (Ramos 2003). 

9.6 Additional Topics 

Only some components are extreme 

Until now, we have only considered the tail function 1 — F(x) for x such that all 
marginal tail probabilities 1 — Fj(Xj) are small. In practice, however, we might 
want to perform estimation and extrapolation in a region of the support of the 
distribution where some, but not all, components are large. This, however, falls 
outside the scope of both the traditional approach based on extreme value distri¬ 
butions and the more recent approach of asymptotic independence in section 9.5. 
All in all, there seems to be a huge gap in the theory and practice of multivariate 
extremes in dire need of being filled in. 

This need was already recognized in Maulik et al. (2002) in the analysis of 
certain internet traffic data. The size of a transmitted file is equal to the product 
of the transmission rate and the transmission time. The distributions of both the 
transmission rate and the transmission time are heavy-tailed, the first one having 
the heavier tail, and their joint distribution is asymptotically independent. However, 
this information is insufficient to characterize the tail of the distribution of the file 
length. To tackle the problem, the authors develop a more refined model, implicitly 
assuming a limit distribution for one variable given that the other one is large. 

Heffernan and Tawn (2004) develop a comparable approach in a general d- 
variate setting. In Gumbel coordinates, they assume that conditionally on one 
variable being extreme, the distribution of the d — l other variables, properly 
centred and scaled, converges to a limit. Inductively proceeding from a num¬ 
ber of analytical examples, they propose a parametric model for the normalizing 
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constants. In this way, they end up with a multivariate semi-parametric regres¬ 
sion model, which they then apply to five-dimensional air quality monitoring data 
recorded at the city centre of Leeds, UK. 

Spatial extremes 

In most studies of multivariate extremes, the number of variables is small, often just 
two. Many environmental phenomena, however, have a spatial dimension, and the 
aim is then to model the spatial dependence within extreme events in continuous 
space based on observations recorded at a grid. The basic modelling tool is formed 
by so-called max-stable processes (de Haan 1984), stochastic processes of which 
all finite-dimensional distributions are multivariate extreme value distributions. 

In the context of coastal flood prevention, Coles and Tawn (1990) consider sea- 
level annual maxima along the British coast, assuming a bivariate logistic model 
for neighbouring sites. Coles (1993) constructs models for the spatial dependence 
of daily rainfall amounts recorded at 11 sites in the south-west of England; see 
also Coles (1994). The same data are considered again in Coles and Tawn (1996a), 
who extend the analysis to the aggregated rainfall over the whole region, and 
in Schlather and Tawn (2003), who construct non-parametric estimators for the 
extremal dependence between sites as a function of the inter-site distance. The issue 
of asymptotic dependence versus asymptotic independence in spatial processes is 
explored in Ancona-Navarrete and Tawn (2002). 

A new impetus is the work by de Haan and Lin (2001). They develop an exten¬ 
sion of the classical multivariate extreme value theory as developed in Chapter 8 
to component-wise maxima of independent, identically distributed stochastic pro¬ 
cesses of a continuous variable. Within the same framework, Einmahl and Lin 
(2003) treat the simultaneous estimation of the tails of the marginal distributions. 

9.7 Summary 

Analysing multivariate extremes involves a number of choices: parametric models 
or not, block maxima or multivariate-threshold exceedances, asymptotic depen¬ 
dence or asymptotic independence, just to mention the most important ones. Unfor¬ 
tunately, the current state of the art does not seem developed far enough to provide 
the user with a fully automatic, universally applicable methodology. Rather than 
that, intelligent judgement of the user is, and probably will always be, necessary. 
To assist the reader in making wise decisions, we provide here an overview of 
all the methods, together with their drawbacks and benefits. We also sketch some 
avenues for further research. 

Statistical inference on the class of multivariate extreme value distributions 
is hampered by the lack of a finite-dimensional parametrization for the depen¬ 
dence structure. A natural option then is to construct parametric sub-families 
that are, on the one hand, sufficiently flexible to satisfactorily approximate any 
given member from the general class and, on the other hand, still analytically and 
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computationally tractable. The benefits of parametric modelling are easier statistical 
inference through likelihood machinery, joint estimation of marginal and depen¬ 
dence parameters, the possibility to tackle complex high-dimensional problems 
through careful model building and natural ways to include covariate information. 
Main drawback of course is the inherent risk of model mis-specification. 

Like in the univariate case, historically, the first and conceptually the sim¬ 
plest multivariate extreme value method is based on block maxima. Observations 
are partitioned into blocks, each block is reduced to the vector of component¬ 
wise maxima, and the collection of block maxima is modelled as an independent 
sample from a common multivariate extreme value distribution. In the bivari¬ 
ate case, there is a range of direct non-parametric estimators for the Pickands 
dependence function to choose from. The efficiency of these estimators is still 
an open issue. The alternative consists of postulating a parametric model for the 
spectral measure and estimating the parameters by maximum likelihood, possi¬ 
bly jointly with the marginal parameters. A common critique to block maxima 
methods, univariate or multivariate, is that they throw away many relevant obser¬ 
vations. 

More efficient are so-called thresholds methods, as these employ all observa¬ 
tions for which at least one coordinate exceeds a corresponding high threshold. 
The aim is now to estimate a distribution in a region where there are almost no 
observations. Starting point is the approximate relation 

F(x) ^ exp[-Z {- log Fxixi ),... ， - log F d (x d )}] (9.121) 

where F is a distribution function in the domain of attraction of an extreme value 
distribution with stable tail dependence function /, and where x is such that 1 — 
Fj(xj) is small for all j = 1, ..., The task therefore can be torn apart into 
estimating the marginal tails and estimating the stable tail dependence function. 

Non-parametric methods are essentially based on the tail empirical dependence 
function, which arises if we take the empirical version of the related approxi¬ 
mation F(x) ^ 1 — /{I — Fi(xi),..., 1 — Fd(xd)}. The tail empirical dependence 
function forms the starting point for fully non-parametric estimators for the spec¬ 
tral measure and the Pickands dependence function. We conjecture that existing 
non-parametric methods can still be improved if they take as a starting point the 
more accurate approximation (9.121). 

Alternatively, the above approximations can be turned into fully parametric 
models for F by assuming a parametric model for / and by modelling the marginal 
tails by GP or GEV distributions. Marginal and dependence parameters can now 
be estimated jointly by maximum likelihood, benefits being transfer of informa¬ 
tion between coordinates, correct assessment of the global estimation uncertainty, 
possibility to exploit common features in the different marginal tails, and natural 
extensions to include covariate information. Some care needs to be taken, how¬ 
ever, in the construction of the likelihood, as the model only specifies the form of 
F in a certain region of its support. Two possible ways to deal with this are the 
so-called point-process method and the censored-likelihood method. We prefer the 
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latter as it yields the more accurate estimates of dependence parameters in case the 
dependence structure is close to asymptotic independence. 

Methods derived from multivariate extreme value distributions only allow for 
models in which the components at extreme levels are either exactly independent 
or asymptotically dependent in the sense that joint extremes occur with a probabil¬ 
ity of the same order of magnitude as a single extreme. This is unsatisfactory for 
asymptotically independent distributions for which the components are still posi¬ 
tively or negatively associated at extreme levels, as quantified by the coefficient of 
tail dependence. The deficiency is overcome by a model for the bivariate survivor 
function that bridges the gap between exact independence and asymptotic indepen¬ 
dence. The merits of the few available parametric and non-parametric inference 
techniques remain to be assessed. 

A common feature of all models for multivariate extremes is that they describe 
the distribution only in that part of its support where all coordinates are extreme. 
The case where only some coordinates are extreme is relatively unexplored. 



10 

EXTREMES OF STATIONARY 
TIME SERIES 


co-authored by Chris Ferro 


10.1 Introduction 


The extremes of time series can be very different to those of independent sequences. 
Serial dependence affects not only the magnitude of extremes but also their qual¬ 
itative behaviour. This necessitates both a modification of standard methods for 
analysing extremes and a development of additional tools for describing these new 
features. In this chapter, we present mathematical characterizations for the extremes 
of stationary processes and statistical methods for their estimation. 

The effect of serial dependence on extremes can be illustrated with a simple 
example. The moving-maximum process (Deheuvels 1983) is defined by 

Xi = max a / Z/_ / , i g Z, (10.1) 

7>0 

where the coefficients aj > 0 satisfy J2j>o a j = ^ an d the Z, are indepen¬ 
dent, standard Frechet random variables, that is, P[Z < x] = exp(—1/x) for 
0 < x < oo; the marginal distribution of is also standard Frechet. A par¬ 

tial realization of the process when ao = a\ = 1/2 (Newell 1964) is reproduced 
in Figure 10.1. The serial dependence causes large values to occur in pairs; more 
general clustering is possible with other choices for the coefficients. This affects 
the distribution of order statistics — for example, the two largest order statistics 
have the same asymptotic distribution — while the presence of clusters of extremes 
is a phenomenon that is not experienced for independent sequences. 
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Figure 10.1 A partial realization of the moving-maximum process X(= 
max(Z,-, Z,-_i)/2. 

Extreme events in the physical world are often synonymous with clusters of 
large values: for example, a flood might be caused by several days with heavy rain¬ 
fall. A single extreme event such as a flood can impact the environment, man-made 
structures and public health and generate a spate of insurance claims. It is therefore 
of great interest to know the rate at which such events can be expected to occur 
and what they might look like when they do. 

There are two approaches to analysing the extremes of time series. One is to 
choose a time-series model for the complete process, fit it to the data and then 
determine its extremal behaviour either analytically or by simulation. This topic 
has been well treated elsewhere, by Embrechts et al. (1997), for instance, and we 
shall touch on it only briefly in section 10.6. The second approach is to choose 
a model for the process at extreme levels only and fit it to the extremes in the 
data. This alternative is attractive because, as we have seen elsewhere in this book, 
models for extremes can be derived under very weak conditions on the process. It 
is on this approach that we shall concentrate. 

We begin in section 10.2 by considering the sample maximum, which can be 
modelled, as for independent sequences, with the generalized extreme value (GEV) 
distribution. In section 10.3, we achieve a characterization for all exceedances over 
a high threshold, which supplies a point-process model for clusters of extremes. 
Models for the extremes of Markov processes are established in section 10.4. Up 
to this point, we shall deal with only stationary (in the strict sense) univariate 
sequences, for which both theory and methods are well developed. In section 10.5, 
we summarize some key results for the extremes of multivariate processes. Finally, 
in section 10.6, we provide the reader with some key references about additional 
topics that, despite their importance, did not make it to the core of the chapter. 

Many of the statistical methods are illustrated for a series of daily max¬ 
imum temperatures recorded at Uccle, Belgium; see also section 1.3.2. The 
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data analysis was performed using R (version 1.6.1), freely available from 
www .r-project.org/. The code can be downloaded from Chris Ferro’s home- 
page on www. met. rdg. ac . uk/~sws02caf /software . html/ and from 
www. UCS_Software . be. [JS: URL to be completed.] 

10.2 The Sample Maximum 

Let X\, X 2 ,... be a (strictly) stationary sequence of random variables with 
marginal distribution function F. The assumption entails that integer h > 0 and 
n > 1, the distribution of the random vector (Xh+i ,..., Xh+n) does not depend 
on h. For the maximum M n = max/ = i 5 ... 5 „ Xf, we seek the limiting distribution 
of (M n — b n )/a n for some choice of normalizing constants a n > 0 and b n . In 
Chapter 2, it was shown that for independent random variables, the only pos¬ 
sible non-degenerate limits are the extreme value distributions. We shall see in 
section 10.2.1 that this remains true for stationary sequences if long-range depen¬ 
dence at extreme levels is suitably restricted. However, the limit distribution need 
not be the same as for the maximum M n = max/=i，.,„ X/ of the associated, inde¬ 
pendent sequence {X ? } with the same marginal distribution as {X/}. The distinction 
is due to the extremal index, introduced in section 10.2.3, which measures the 
tendency of extreme values to occur in clusters. 


10.2.1 The extremal limit theorem 

For a set J of positive integers, let M(J) = max i€ / Xj. For convenience, also 
set M(0) = — 00 . We shall partition the integers {1 ,... ,n] into disjoint blocks 
Jj = Jj, n and show that the block maxima M(Jj) are asymptotically independent. 
Since M n = max ; - it follows as in Chapter 2 that the limit distribution of 

(M n — b n )/a n , if it exists, must be an extreme value distribution. 

Let (r n ) n be a sequence of positive integers such that r n = o(n) as n —> 00 , 
and put k n = |_"/ 厂行 」. Partition {1,..., n} into k n blocks of size r n , 

J j = Jj，n ~ {(7 _ 1 ) 厂 ” + 1 ， •.. ，《 / 厂 《}， j = 1 ， ... ， kn ， (10.2) 

and, in case k n r n < n, a remainder block, Jk n -\-i = { 々 〆„ + 1,n}. Now define 
thresholds u n increasing at a rate for which the expected number of exceedances 
over u n remains bounded: limswp nF(u n ) < 00 , with of course F = l — F. We 
shall see that, under an appropriate condition, 

kn 

P[M n < u„) = f] P[M(Jj)<u n ]+o(l) 

；=1 

= (P[M rn < u n ]) kn + ^(1), n^oo. (10.3) 

This is precisely the desired representation of M n in terms of independent random 
variables, M rn . 
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To find out when (10.3) holds, observe that 
PW n <u n ] = P 


灸 n + l 


Since P[M(Jj) > u n ] < r n F(u n ) —> 0, the remainder block can be omitted: 


P[M n < u n ] = P 


k n 


< u n } 


+ o(l), n ^ oo 


A crucial point is that the events {Xi > u n } are sufficiently rare for the probability 
of an exceedance occurring near the ends of the blocks Jj to be negligible. Let 
(s n ) n be a sequence of positive integers such that s n = o(r n ) as n —> oo, and let 
Jj = Jj n = {jr n — + 1, ..., jr n ] be the sub-block of size s n at the end of Jj. 

The sub-blocks are asymptotically unimportant, as 


P 


kn 


1J{M(/j) > u n } 


— k n s n 尹 〈 Un) — > 0, 


n —> oo. 


This leaves us with 


尸 [M„ < u n ] = P 


k n 


< u n } 


+ 0(1)， 


n oo, 


where the JJ = {(j — \)r n + 1, ， jr n — are separated from one another by 
a distance s n . If the events < u n ] are approximately independent then we 

would obtain, as required, 


kn 


P[M n < u n ] = Y\ P[M(J*)<u n ] + o(l) 


kn 


f| P[M(Jj) < u n ] + o(l), oo, 


using again k n P[M(Jj) > u n ] < k n s n F(u n ) —> 0 as n ^ oo. 

A mixing condition known as the D{u n ) condition (Leadbetter 1974) suf¬ 
fices for the events < u n ] to become approximately independent as n 

increases. Let 

工 j ， k(u n ) = < u n } : I c 

be the set of all intersections of the events {Xi < u n ], j < i < k. 
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Condition 10.1 D(u n ). For all A\ e X\j{u n ) y A 2 e T/ +Sfn (u n ) and l < l < n — s, 

\P(A x nA 2 ) - P(A 1 )P(A 2 )\ < a(n,s) 

and a(n, s n ) -> 0 as n —> 00 for some positive integer sequence s n such 
that s n = o(n). 


The D(u n ) condition says that any two events of the form < u n ] and 

{M(/ 2 ) < u n ] can become approximately independent as n increases when the 
index sets // C {1,..., n} are separated by a relatively short distance s n = o(n). 
Hence, the D(u n ) condition limits the long-range dependence between such events. 

Now if the events A\,..., Ak g 工 i ， n (u n ) are such that the corresponding index 
sets are separated from each other by a distance s, then, by induction on k, we get 


P 



k 

-n 尸 ㈧) 



< ka(n, 5 1 ). 


Therefore, if s n = o(r n ) and k n a(n, s n ) 0, then 


P 


kn 

< u n } 


kn 

-fl p[W;) s «,,] 


< k n a(n, s n ) 0, 


as w > 00 . When a(n, s n ) —> 0 for some s n = o(n), it is indeed possible to find 
r n = o(n) such that s n = o(r n ) and k n a(n, s n ) -> 0; take, for instance, r n to be 
the integer part of [n maxj^, na(n, s n )}] { ^ 2 . Together, we obtain the following 
fundamental result. 


Theorem 10.2 (Leadbetter 1974) Let {X n } be a stationary sequence for which 
there exist sequences of constants a n > 0 and b n and a non-degenerate distribution 
function G such that 


P 


— b n 


< x 


V 


G(x), 


n —> 00 . 


If D(u n ) holds with u n = a n x + b n for each x such that G(x) > 0, then G is an 
extreme value distribution function. 

Note that the D{u n ) condition is required to hold for all sequences u n = a n x + 
b n for which G(x) > 0. The necessity of this requirement is shown by the process 
Xi 三 X\, for which D(u n ) holds as soon as F(u n ) —> 1 as w —> 00 . Nevertheless, 
the condition is weak as it concerns events of the form {Xi < u n ] only. Compare 
this with strong mixing (Loynes 1965), for example, which requires Condition 10.1 
to hold for classes of sets = cr(X/ : j < i < k), the cr-algebra generated by the 
random variables Xj,, X^. For Gaussian sequences with auto-correlation p n at 
lag n, the D(u n ) condition is satisfied as soon as p n logn —> 0 as n —> 00 (Berman 
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1964). This is much weaker than the geometric decay assumed by auto-regressive 
models, for example. 

In fact, Theorem 10.2 holds true for even weaker versions of the D(u n ) con¬ 
dition as may be evident from our discussion. One example (O’Brien 1987) is 
asymptotic independence of maxima (AIM), which requires Condition 10.1 to 
hold when 


工 j'k 、 Un) = \{M(I) < U n ] ： I = {/i, ... ， 6} G {«/• ， ... ， 众}}， 

comprising block maxima over intervals of integers rather than arbitrary sets of 
integers. This weakening admits a class of periodic Markov chains. 


Example 10.3 The max-autoregressive process of order one, or ARMAX in short, 
is defined by the recursion 

Xi = max{ofX/_i, (1 — a)Zi], i e Z, (10.4) 


where 0 < a < 1 and where the Z/ are independent standard Frechet random vari¬ 
ables. A stationary solution of the recursion is 

Xi = max a) (1 — i e Z, 

j>o 

showing that the ARMAX process is a special case of the moving-maximum pro¬ 
cess of the introduction; in particular, the marginal distribution of the process is 
standard Frechet. Furthermore, the D(u n ) condition can be shown to hold for gen¬ 
eral moving-maximum processes, so we expect the limit distribution of M n /n to 
be an extreme value distribution. Indeed, for 0 < x < oo, we have 

P[M n <x] = P[Xi < x, (1 - a)Z 2 < x, ..., (1 - a)Z n < x] 

=P[X x < x]{P[(l - a)Z x < x]}^ 1 
=exp [- {1 + (1 - a)(n- l)}/x] (10.5) 

so that 

P[M n / n < x] —> exp{—(1 — a)/x] =: G(x), n —> oo. 

Compare this with the limit distribution G(x) = exp(—1/x) of M n /n. We shall 
discover in section 10.2.3 that the relationship G{x) = G(x) l ~ a is no coincidence. 


If Theorem 10.2 holds, then we can fit the GEV distribution to block maxima 
from stationary sequences. For large n, we have 


P[M n < a n x + b n ] ^ exp 


1 + K 
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1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 

Year 


say. Therefore 

P[M n < x]^ exp — 1 + y ( -— 

L V ^ 

where the parameters /x = a n iio + b n and a = a n o\) assimilate the normalizing 
constants. The parameters (/x, a, y) can be estimated by maximum likelihood, for 
example, as in the following section. 

10.2.2 Data example 

The data plotted in Figure 10.2 are daily maximum temperatures recorded in 
degrees Celsius at Uccle, Belgium, during the years from 1901 to 1999. All days 
except those in July, which is generally the warmest month, have been removed 
in order to make our assumption of stationarity more reasonable. These data are 
freely available at www. knmi . nl/samenw/eca as part of the European Climate 
Assessment and Data set project (Klein Tank et al. 2002). 

We begin our analysis of these data by fitting the GEV distribution to the July 
maxima. The maximum-likelihood estimates of the parameters, with standard errors 
in brackets, are \x = 30.0 (0.3), a = 3.0 (0.2) and y = —0.34 (0.07). The diag¬ 
nostic plots in Figure 10.3 indicate a systematic discrepancy due perhaps to mea¬ 
surement error or non-stationary meteorological conditions, but the most extreme 
maxima are modelled well. The estimate of the upper limit for the distribution 
of July maximum temperature obtained from the GEV fit is /i — = 38.7 0 C, 

with profile-likelihood 95% confidence interval (37.3, 43.9). The estimated 100, 




9COOJCOOOOJ 寸 OJOOJCD - ! - 
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Figure 10.2 Daily maximum temperatures in July at Uccle from 1901 to 1999. 
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Figure 10.3 Quantile and return level plots for the generalized extreme value 
distribution fitted to July maximum temperatures. 


1000 and 10,000 July return levels are 36.9 (36.2, 38.6) ， 37.9 (36.9, 40.5) and 38.3 
(37.2, 41.8). We shall investigate other features of these data in later sections. 

10.2.3 The extremal index 

Theorem 10.2 shows that the possible limiting distributions for maxima of station¬ 
ary sequences satisfying the D(u n ) condition are the same as those for maxima of 
independent sequences. Dependence can affect the limit distribution, however, as 
illustrated by Example 10.3. We investigate the issue further in this section. First 
note that approximation (10.3) is also true for independent sequences. The effect 
of dependence is therefore to be found in the distribution of block maxima, M rn . 

Choose thresholds u n such that nF(u n ) x for some 0 < r < oo. For the 
associated, independent sequence, 

~ 1 - V 

P[M n < u n ] = {F(u n )} n = 1 —— nF(u n ) \ exp(-r), n ^ oo. 

n \ 

For a general stationary process, however, P[M n < u n ] need not converge and, if 
it does, the limit need not be exp(—r). 

Suppose that u n and v n are two threshold sequences and that 

nF(u n ) r, P[M n < u n ] exp (— 入)， 

nF(v n ) v, P[M n < v n ] exp(-T^), 


as n ^ oo, where r, u g (0, oo) and X, ^ e [0, oo). We show that if D{u n ) holds, 
then X/r = \j//v =: 0. In other words, P[M n < u n ] —> exp(—0r) and the effect 
of dependence is expressed by the scalar 0, independently of r. 




EXTREMES OF STATIONARY TIME SERIES 


377 


Without loss of generality, assume that r > v and define n f = l(v/r)n}. 
Clearly n f F{u n ) v so that 

\P[M n ' < u n ] - P[M n t < u n /]| < n f \F{u n ) - F(v n >)\ 0 

and thus P[M n / < u n ] exp(— 少 ） as n —> oo. Now suppose as in section 10.2.1 
that (r n ) n and (s n ) n are positive integer sequences such that r n = o(n), s n = o(r n ), 
and (n/r n )a(n, 5 n ) ^ 0 as w > oo. Since n f < n, we have by (10.3) 

P[M n ，< u n ] = P[M rn < + o(l), 

P[M n < u n ] = P[M rn < + o(l), 

and thus 

— P[M rn > u n ] j/t ， — P[M rn > u n ] X, 

as n —> oo. Since n’ 〜 （ u/r)w，we must have A/r = as required, and 

n 入 r P[^r n > ^n] "A /：、 

0 = - = lim - = - . (10.6) 

r n—oo r n F(u n ) 

This argument is the basis for the following theorem. 


Theorem 10.4 (Leadbetter 1983) If there exist sequences of constants a n > 0 and 
b n and a non-degenerate distribution function G such that 


P 


— b n 


< x 


v 


— G(x), 


OO, 


if D{u n ) holds with u n = a n x + b n for each x such that G(x) > 0 and if P[(M n — 
b n )/^n S x] converges for some x, then 

M n — b n I v ~ a 

P - < x G(x) := G°(x), n ^ oo, 

_ a n _ 

for some constant 0 G [0, 1]. 


The constant 0 is called the extremal index and, unless it is equal to one, the 
limiting distributions for the independent and stationary sequences are not the same. 
If 沒 > 0, then G is an extreme value distribution, but with different parameters 
than G. In particular, if Qi, cr, y) are the parameters of G and (/i, a, y) are the 
parameters of G, then their relationship is 

y = y, /X = jl — a - , o = aO y , (10.7) 

' Y 

or, if y = 0, taking the limits /x = /x + cr \og0 and a = a. Observe that the extreme 
value index y remains unaltered. 
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Example 10.5 The derivation in Example 10.3 shows that the extremal index of 
the ARMAX process is 0 = l — a. More generally, for the moving-maximum 
process (10.1), we have 


< nx) 


P 


max(a,Zi—/) < nx ,..., max(a 7 Z„_ 7 ) < nx 
j>0 — j>0 


P 


max max (a i+/ Z_ ? -) < nx, 

i>0 \<j<n ’ 


max max (of/Z/) < nx 

l<i<n 0<j<n—i 


=exp 




max 


1 


We treat both sums separately. The first sum can, for positive integer m, be 
bounded by 


n 


E 

i>0 




m 1 v m 

- 1 —— > max < — 

n n i<j<n n 

i>m 


+ H a i- 


Let m tend to infinity to obtain that n~ { ^- >0 max\<j< n 0 as w ^ oo. For 

the second sum, let of(i) = max 7 >o aj. Since maxo< 7 <? aj -> a(i) as / —> oo, we 
have n~ l Yll=o max o< 7 <? a j a (i) as n 一 oo. Together, we obtain 0 = a(i). 


Asymptotic independence 

The case 汐 =1 is true for independent processes, but it can be true for dependent 
processes too. The following condition (Leadbetter 1974) is sufficient when allied 
with D(u n ). 

Condition 10.6 D r (u n ). 

Vn/k\ 

lim limsup n > P[X\ > u n , Xj > u n ] = 0. 

7=2 

To see the effect of D\u n ), apply the inclusion-exclusion formula to the event 
{M rn > u n ) = U/li{^' > u n\ to obtain 

r n r n 

^( M «) ^ P \- M r n > U n] > ^( M ») 一 X! P ^- Xi > Un ' x i > U "^ 

i=l i=l l<i<j<r n 

Therefore, P[M rn > 〜 r n F{u n ) and ^ = 1 by (10.6) if 

〉 : P[Xi > u n , Xj> Un\ = o{k n F (w«)} = o{j~ n In 、 

^<i<j<r n 
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sls n ^ oo. Since the sum is not greater than r n P(Xi > u n , Xj > u n ), this 
is satisfied if D\u n ) holds. In contrast to the D(u n ) condition, which controls 
the long-range dependence, the D\u n ) condition limits the amount of short-range 
dependence in the process at extreme levels. In particular, it postulates that the 
probability of observing more than one exceedance in a block is negligible. 

Example 10.7 When of = 0, the ARMAX process (10.4) is independent and the 
D\u n ) condition holds. On the other hand, 

=1 — P[X\ < U n ] — P[X2 < u n ] + P[X\ < U n , X 2 < u n ] 

=1 - 2exp(-l/M„) + P[Xi < u n , (1 - a)Z 2 < u n ] 

=1 — 2exp(—1 /m„) + exp{(a — 2)/u n } 

so that nP[X\ > u n , X 2 > u n ] a/x when u n = nx for some 0 < x < 00 , that 
is, D’ （ Un) fails if a > 0. 


Positive extremal index 

The case 沒 = 0 is pathological, although not impossible, see Denzel and O’Brien 
(1975) or Leadbetter et al. (1983), p. 71. It entails that sample maxima M n of the 
process are of smaller order than sample maxima M n of the associated independent 
sequence. Also, the expected number of exceedances in a block with at least one 
exceedance converges to infinity, see (10.10) below. For purposes of statistical 
inference, it will turn out to be convenient to assume that 0 < ^ < 1. A sufficient 
condition is that the influence of a large value X\ > u n reaches only finitely far 
over time, as in Condition 10.8 below. For integers 0 < 7 < A:, we denote Mj,k = 
max{X 7 + i,, Xk] (with max 0 = — 00 ) and Mk — Mo, 众 . 

Condition 10.8 The thresholds u n and the integers r n are such that F(u n ) < 1, 
F{u n ) — > 0, > 00 and 

lim lim sup P[M m rn > u n \ X\ > u n ] = 0. (10.8) 

n->oo 

For integer m > 1, by decomposing the event {M rn > u n ] according to the time of 
the last exceedance, 

1/ n / 爪」 

P\^M rn > W n ] > 〉: 尸 [X(/-l)m+l > 以《， ^im,r n — ^n\ 
i=l 

— ^\.^m,r n — I ^1 ^ ^n]* 

For large-enough m, therefore, Condition 10.8 guarantees that 
P[M rn > U n ] 1 

liminf—S -- > liminf —P[M m rn <u n \X x > u n ] > 0. (10.9) 

OO r n F(u n ) oo m 
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Hence, if also r n = o(n) and na(n, s n ) = o(r n ) for some s n = o(r n ), then 0 must 
indeed be positive by (10.6). 

Blocks and runs 

The extremal index has several interpretations. For example, 0 = lim^f (w„), where 

Mr n > u n (10.10) 

is the expected number of exceedances over u n in a block containing at least one 
such exceedance. Therefore, the extremal index is the reciprocal of the limiting 
mean number of exceedances in blocks with at least one exceedance. 

Another interpretation of the extremal index is due to O’Brien (1987). Assume 

again Condition 10.8 and let the integers I < s n < r n be such that s n ^ oo and 
''一 1/2 
s n = o(r n ) as n oo; for instance, take s n the integer part of r n . On the one 

hand, we have 

r n 

P[Af r ^ > Un\ = 〉 : P[Xi > U n , NL[j n < Un\ 
i=l 

— ^(^n)^[^l,r n — I -^1 ^ 

and on the other hand, 

P[Al rn > Un\ ^ S n F(jin) {yn — s n ~)F(^u n ^P[AI\ Sn < u n I > w n ]. 

Moreover by (10.8) 

0 $ < U n I X[ > Un\ — P\^M\ rn < U n I X\ > U n ] 

— I 义 1 > 以 《] ~^ 0 . 

Writing 

= P[M Urn < U n I X! > U„] (10.11) 

we see that the upper and lower bounds on P[M rn > u n ] give 

0n( u n) = 0^(u n ) + 0(1). 

Therefore, 0 = \im0^(u n ) represents the limiting probability that an exceedance 
is followed by a run of observations below the threshold. Both interpretations 
identify 0 = 1 with exceedances occurring singly in the limit, while 0 < l implies 
that exceedances tend to occur in clusters. Yet another interpretation of the extremal 
index, in terms of the times between exceedances over a high threshold, is given 
in section 10.3.4. 


r n F(u n ) 


0^(u n ) P[M rn > u n ] 


E 


〉: id > 
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Example 10.9 For the ARM AX process of Example 10.3, we can derive the 
extremal index 0 = l — a by combining (10.5) with the block (10.10) or run 
(10.11) definitions, where u n = nx for some 0 < x < oo and r n is such that 
r n ^ oo but r n = o(n). Regarding the run definition (10.11), observe that by 
stationarity, 

0^{u n ) = {P[M rn -\ < U n ] - P[M rn < UnW/FiUn). 


Statistical relevance 


Theorem 10.4 shows how the extremal index characterizes the change in the distri¬ 
bution of sample maxima due to dependence in the sequence. Suppose 0 < ^ < 1. 
If G^(p) is the quantile function for the limit G, then the quantile function for G is 
G — (p) = < G^(p). This inequality has implications for the estimation 

of quantiles from dependent sequences. 

Suppose that we estimate the parameters (/x, cr, y) of G by fitting, for example, 
an extreme value distribution to a sample of block maxima M n . As before, the nor¬ 
malizing constants are assimilated into the location and scale parameters so that 
P[M n < x] ^ {F{x)} n0 ^ G(x), the latter being a GEV distribution with parame¬ 
ters (y, /x, cr). We can exploit this relationship as in section 5.1.3 to approximate 
marginal quantiles by 


F^-(l - p) ^ G^{(1 - p) n6 } 


/x + cr 


{-n0\og(l- p)}~y - 1 

y 


if K / 0, 


[x — o log {—n0 log(l — p )}, if y = 0. 


If we neglect the extremal index, then we risk underestimating the marginal quan¬ 
tiles. Conversely, suppose that we have an estimate of the tail of the marginal 
distribution F. Then the mn-observation return level is approximated by 

G—(1 - 1/m) ^ F^{(1 - 1/m) 1 ’ ⑽) }. 


If we neglect 0 here, then we risk overestimating the return level. These two 
examples show why it is important to be able to estimate the extremal index. We 
discuss this problem in section 10.3.4, where the different interpretations that we 
have already seen for 0 will motivate different estimators. 

Finally, note that the frequency at which a process is sampled has consequences 
for the distribution of maxima. For example, let M f n be the maximum from the 
sequence sampled every m >2 time steps, with corresponding extremal index 
0 m . Then 

P[M n <x]^ {F{x)} ne ^ {P[M' n < x]} n 秦 . 

Robinson and Tawn (2000) develop methods based on this approximation that 
enable inference for the distribution of M n from data collected at the frequency 
of M f n . 
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10.3 Point-Process Models 

In this section, we broaden our outlook from the sample maximum to encompass 
all large values in the sequence, where ‘large’ means exceeding a high threshold. 
A particularly elegant and useful description of threshold exceedances is in terms 
of point processes. We shall see that these models are related to the distribution 
of large order statistics, describe the clustering of extremes and motivate statistical 
methods for the analysis of stationary processes at extreme levels. A brief and 
informal introduction to point processes is given in section 5.9.2; more detailed 
introductions focusing on relevant aspects for extreme value theory may be found 
in the appendix of Leadbetter et al. (1983), in Chapter 3 of Resnick (1987) and in 
Chapter 5 of Embrechts et al. (1997). 


10.3.1 Clusters of extreme values 

Let us seek the limit of the point process 

N n (') = 107" e . )， X={i \ Xi > u n , l <i < n], (10.12) 

iel 

which counts the times, normalized by n, at which the sample {X,}^ =1 exceeds a 
threshold u n . This process is related to order statistics by the relationship 

P[X n -k,n < u n ] = P[N n ((0,l])<kl (10.13) 


If we can find the limit process of N n ， then we shall be able to derive the limiting 
distribution of the large order statistics. 

Let the thresholds u n be such that the expected number of exceedances remains 
finite, with nF(u n ) —> r g (0, oo), and reconsider the partition (10.2) of {1,..., n} 
into k n = blocks Jj of length r n = o(n). The exceedances in a block are 

said to form a cluster. Now, because of the time normalization in N n , the length 
of a block, r n /n, converges to zero as n ^ oo, so that points in N n making up 
a cluster converge to a single point in (0, 1]. In the limit, therefore, the points in 
N n represent the positions of clusters and form a marked point process with marks 
equal to the number of exceedances in the cluster. 

The distribution of the cluster size in N n is given by 


^n(j) = P 


r n 

1(^' > = j 




> U n 


7 = 1,2,..., (10.14) 


and the mark distribution of the limit process, if it exists, will be 7r = lim 7t n . Recall 
that the events < u n ] are approximately independent under D(u n ), Condi¬ 
tion 10.1. If we can say the same for the random variables < u n ], then 

the number of clusters occurring in N n during an interval I c (0, 1] of length |/| 
is approximately binomial, with probability p n = P[M rn > u n ] and mean p n k n \I\. 
If the process also has extremal index 汐 > 0, then by (10.6) ， 〜 0r n F{u n ) 0 
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and p n k n —> Ox > 0 as w 一 oo. Therefore, the number of clusters in I approaches 
a Poisson random variable with mean Or\I\. We might expect clusters to form a 
Poisson process with rate Or and N n to converge to a compound Poisson process 
CP(0r, n). 

Convergence in distribution of N n to a CP(0r, it) process N is equiv¬ 
alent to convergence in distribution, for all integer m > 1 and disjoint 
intervals I\,..., I m C (0, 1], of the random vector (N n (J\), ..., N n {I m )) to 
(N(I\) ， … ， N(I m )). A convenient way to check the latter is by proving con¬ 
vergence of Laplace transforms, that is, by showing that 

Cn(h ， ...,t m ) = E exp |- (10.15) 

converges for all 0 < U < oo (/ = 1, ..., m) to 

m 

， .. • ， 【 m) = exp 
i=l 

The limiting factorization of C n is achieved in much the same way as the factor¬ 
ization (10.3) of P[M n < u n ] except that a mixing condition stronger than D(u n ) 
is required (Hsing et al. 1988). Let = cr({X/ > u n ] : j < i < k). 

Condition 10.10 A(u n ). For all A\ e T\^{un), A 2 e ^-/+ 5 ,n(w n ) and 

I < l < n — s, 

\P(A x nA 2 ) - P(A 1 )P(A 2 )\ < a(n,s) 
and a(n, s n ) 0 as n ^ oo for some s n = o(n). 

The A(u n ) condition is more stringent than the D(u n ) condition only in the number 
of events for which the long-range independence is required to hold; it is still 
weaker than strong mixing, for example. Lemma 2.1 of Hsing et al. (1988) tells us 
that we also have, for M l < l < n — s, sup \E(BiB 2 ) — E(B\)E(B 2 )\ < 4a(n, s), 
where the supremum is over all random variables 0 < < 1 measurable with 

respect to T\^(un) and 0 < ^2 < 1 measurable with respect to Ti^s^iun). This is 
precisely what we need to handle the Laplace transform (10.15). 

Fix an interval I c (0, 1] with positive length |/|. Let {r n ) n be a sequence 
of positive numbers such that r n /n — > 0 as n —> 00 . Consider the partitioning 
I = Ji of I into disjoint, contiguous intervals Ji with lengths |//| = r n /n 

for i = 1,... ,m n and |/ mn +i| < r n /n. In particular, 〜 (w/r„)|/|. Now, assume 
there exists a sequence (s n ) n of positive numbers such that s n = o(r n ) and 
na(n, s n ) = o(r n ) as w —> 00 . Repeating the block-clipping technique that led to 
Theorem 10.2 yields 


- 叫 //I 


7>1 


(10.16) 


E exp{-^„(/)} = [E exp{-^„(/i)}] (n/rw)|/| + o(l), n ^ 00 . 



384 


EXTREMES OF STATIONARY TIME SERIES 


Repeating a similar procedure for the Laplace transform (10.15), we obtain 

m 

C n (h, ... ， f m ) = n [£ex P{- ? ^" ( - /l)}]< " /r，!)|/， ' 1 + 0(1) , n^oo. 

i=l 

It remains to check that each term in the product converges to the corresponding 
factor in the Laplace transform (10.16). If n n (j) -> 7t(j) for each integer j > 1, 
then the desired convergence is a consequence of 

E QXp{-tN n (Ji)} = P[Mr n <U n ]-\-^2 兀 n(j)P[M rn > U n ]Q~ jt 

；>1 

=1 - (r n /n)0r 1 - Tt(j)Q~ jt o(l) 
j>i 

Theorem 10.11 (Hsing et al. 1988) Let {X/} be stationary with extremal index 
0 > 0. Let there exist a sequence of thresholds u n for which A(u n ) holds and 
nF(u n ) r G (0, oo). Let there exist positive sequences s n and r n and a distri¬ 
bution 7i such that s n — o(r n ), r n — o(n), na(n, s n ) = o(r n ) and ic n (j) n(j) for 
. . t> . 

all integer j > \ as n ^ oo. Then N n N, where N is CP(^r, tc). 



A similar result was also obtained by Rootzen (1988). The rate of convergence 
for N n and other point processes presented in this section has been investigated by 
Barbour et al (2002) and Novak (2003) among others, where bounds are given for 
metrics such as total variation distance. 

Theorem 10.11 tells us that Ox clusters occur in (0, 1] on average and that the 
cluster sizes are independent with distribution n. Since the expected number of 
exceedances in (0, 1] is r, this means that the average cluster size should be I/O. 
This was noted by Leadbetter (1983) and follows from our definition (10.10) of 
since 


6> _1 = lim E 

n->oo 


r n 

〉: id > 知 ) 




=lim y^jnnij). 

n^-oo ^ 

7>1 


(10.17) 


By Fatou’s lemma, we have 0~ l > jjt(j )， the mean of the limiting cluster 
size distribution. Smith (1988) shows by counterexample that not necessarily 0~ l = 
X^j>i Xj)，although Hsing et al. (1988) give mild extra assumptions under which 
this is actually true. Note also that 丌（ 1) = 1 if ^ = 1. 


Example 10.12 The cluster-size distribution of the ARM AX process (10.4) may 
be found intuitively as follows. Let X/ > u n be the first exceedance in a block. 
Subsequent values in the sequence will be aX(, a 2 Xi,... with high probability, 
and the probability of observing another such run in the same block is negligible. 
With high probability, the number of exceedances in a block will therefore be j 



EXTREMES OF STATIONARY TIME SERIES 


385 


provided aJXi < u n < Hence 

兀 n(j) = p [a j Xi <u n < a j ~ l Xi I X\ > u n ] + o(l) 

_ exp (~a j /u n ) - exp {- 0 L j ~ { /u n ) 

1 — exp(—l/w„) +0 

—> (1 — n —> oo, 

that is, the limiting cluster-size distribution is geometric with mean 
(1 -a)~ l = 0~ x . 

Order statistics 

Relation (10.13) allows us to derive from Theorem 10.11 the limiting distribution 
of order statistics; see Hsing et al. (1988) and Hsing (1988), for example. First, 
for blocks Jj in (10.2), let be the point process of cluster positions, 

K(.) = Yjjr ， M.V I = u - M (Jj) > 1 < j < k n }, (10.18) 

and let P[M n < u n ] G(x) = exp (— 沒 r). It follows from Theorem 10.11 that 

T> . 

> N*, where N* is a Poisson process on (0, 1] with rate Or = — log G(x). If 
K\, K2, ... are independent random variables with distribution 7r, then the limit 
of P\^X-n—k,n — ^n\ IS 

P[N((0. 1]) < k] 

k r j 

= 尸 [w*((0, 1]) = 0] + J] P[N*((0, 1]) = j]P J2 K '^ k 
7=1 L/=l . 

= G W {i + EE^Mp[e^-1}- (10.19) 

j=l i=j J . L/=l 」J 

For example, 

P[X n -l, n < u n ] -> G(x){l - 丌⑴ log GO )}， 

P[X n -2,n < U n ] G(x) 1 - { 兀⑴ + 7T(2)} log G(x) + ^{ 丌⑴ log G(x)} 2 . 

Setting 7r(l) = 1 and 7t(j) = 0 for j >2 yields the limit distributions for the 
associated, independent sequence as in section 3.2. 

The joint distribution of X n , n and X n -k,n for any k > l, and indeed of any 
arbitrary set of extreme order statistics, can also be derived (Hsing 1988) although 
the class of limit distributions does not admit a finite-dimensional parametrization. 
Simpler characterizations are possible if stricter mixing conditions are imposed 
(Ferreira 1993). 
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10.3.2 Cluster statistics 

Various properties of a cluster of exceedances may be of interest, such as the 
cluster size, the peak excess, or the sum of all excesses. In this section, we define 
a generic cluster statistic and give a characterization of its distribution that will be 
useful in section 10.4. We shall investigate point processes that focus on specific 
cluster statistics in the next section. 

We study cluster statistics c{(Xi — u n ) r ^ =x } for the following family of func¬ 
tions c. 

Definition 10.13 (Yun 2000a) A measurable map c : M U R 2 U R 3 U • • • —> R is 
called a cluster functional if for all integers I < j < k < r and for all 
(x\,, x r ) such that Xi < 0 whenever i = l,j — 1 or i = k l,..., r we 
have c{x \,, x r ) = c(xj ，, Xk). 


Example 10.14 Most cluster functionals of practical interest are of the form 

r 

C(Xi, . . . , X r ) = 〉: 0 , 

i=-m-\-2 

where 0 is a measurable function of m variables (m = 1 , 2 ,...) and xt = 0 when¬ 
ever / < 0 or / > r + 1 ; the function 0 should be such that 0 (xi,..., x m ) = 0 
whenever x, < 0 for all i = 1 ， … ， m. Consider the following examples: 

• m = 1 and 0 (x) = l(x > 0 ) gives the number of exceedances; 

• m = 1 and 0 (x) = max(^, 0 ) gives the sum of all excesses; 

• m = 2 and 0 (xi, X 2 ) = l(x\ < 0 < X 2 ) gives the number of up-crossings 
over the threshold; 

• m = 1 , 2 ,... and 0 (xi,, x m ) = l(x\ > 0,..., x m > 0 ) gives the number 
of times, counting overlaps, there are m consecutive exceedances. 

A cluster functional that is not of this type is the cluster duration 

C \ ^ ^ Xf ) 

— maxjy — / + 1 : 1 < / < 7 < r, x/ > 0 , xj > 0 } if maxx, > 0 
0 otherwise. 

For general stationary processes, it turns out that the distribution of a cluster 
statistic can approximately be written in terms of the distribution of the process 
conditionally on the event that the first variable exceeds the threshold. 


Proposition 10.15 (Segers 2003b) Let {X；} be stationary. If the thresholds u n and 
the positive integers r n are such that Condition 10.8 holds, then, for every sequence 
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of cluster functionals c n and Borel sets A n C M, 

PlcniiXi - A„|M r „ >u„] (10.20) 

= 0- l ^P[c n {(Xi - « 息 } eA n \x l > u n ] 

-- G A n , I Xi > w /z ]} + 0(1) 

as n ^ oo, where 0 n can be either 0^(u n ) or e^{Un). 


Specifying the c n and A n in (10.20) leads to interesting formulae, illustrating the 
usefulness of Proposition 10.15. For instance, with c n (x\,, x r ) = Ifc > 
0) and A n = [j, oo) for some integer 7 > 1, we obtain an approximation of the 
cluster-size distribution: 


P 


r n 

J2 1 (^- >««) > j 



Mr n > Mn 



r n 

^ l(Xi > u n ) = j - \ 

_ i=2 


X\ > u n \-\ - 0 ( 1 ). 


This formula can be used to give a formal derivation of the limiting cluster-size 
distribution of the ARMAX process (Example 10.12). 

Formula (10.20) also shows that the cluster maximum asymptotically has 
the same distribution as an arbitrary exceedance. For, setting c n {x \,..., x r )= 
Y^i=\ lfe > a n x) and A n = [1, 00 ), we obtain 


P 


M rn — u n 

-0~ l P 


> X 


Af rn > u n 


an 

e=(u n - \-a n x) 


^^>x, <x 


a n 




+ ^(D 


P 


- > X 

j ^ n 

_ a n 

_ 


+ 0 ⑴. 


o n Hu n ) 

Hence, if lim0^(u n + a n x) = \im0^{u n ) = 0 > 0, then indeed 


P 



— 




> X 


_ 


^r n ^ 

=P 


X\ — u n 


> x 


Xi> u n \ +0(1). (10.21) 


This notion is less surprising once it is realized that clusters with large maxima 
tend to contain other large exceedances. 


10.3.3 Excesses over threshold 

We have already seen a point process (10.12) with a limit that involves the cluster 
size. This corresponds to the first example of a cluster statistic in the previous 
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section. The second example, concerning excesses over threshold, motivates the 
marked point process 


z„(.) = E X ，一％ •/“.)， 

/el an 


X = {i : Xi > u n , I < i < n], 


where each exceedance is marked with its excess. The normalizing constant a n is 
used to ensure a non-degenerate limit for the distribution of the aggregate excess 
within a cluster, 


7 T,;(X) = i 5 


r n 

a,: 1 E (& 


Wn) + < 




( 10 . 22 ) 


In order to obtain limits for processes based on excesses, we require limiting 
independence of (Xi — w„)+ instead of 1(Z/ > u n ). Therefore define A\u n ) to be 
the same as Condition 10.10 but with Tj^{un) = cr{(Xi —«„)+: j < i < k} and 
write a\n, s) for the corresponding mixing coefficients. 


Theorem 10.16 (Leadbetter 1995) Let {X/} be stationary with extremal index 0 > 
0. Let there exist a sequence of thresholds u n for which A f (u n ) holds and nF(u n ) 
x G (0, oo). Let there exist positive integer sequences s n and r n — o(n) and a dis¬ 
tribution n r such that s n = o(r n ), r n — o{n), na\n, s n ) = o(r n ) and 7X r n -5 - Tt r as 
x> 

n oo. Then Z n Z, where Z is CP(0r, Tt r )l. 


The limit process here is the same as that in Theorem 10.11 except that the 
mark distribution now describes the cluster excess; the method of proof is also 
similar. Results with different marks may be obtained analogously (Rootzen et al. 
1998) as long as the appropriate mixing condition holds and the limiting mark 
distribution exists. One case is more substantial, that of the excess of just the 
cluster maximum, or peak, leading to the marked point process 

Z：(0 = M(Jj) ~ Un 〜"《(•)， x ={j- M(Jj) k n ), 


for the blocks Jj in (10.2). The peak-excess distribution is 
<(X )= 尸 


M rn — u n 


< x 


an 


Mr n > Mn 


and, unlike 丌 and n f above, here we are able to specify the form of 丌 * = linur* 
when it exists. If 0 > 0, then, by (10.21)，we have 


<(X) = p 


p 

X\ — u n 

■X. J 

- < a: 


_ a n 

_ 


+ 0(1), 
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By Pickands (1975), the domain-of-attraction condition implies that the limit of 
the latter distribution is the Generalized Pareto (GP) distribution, that is, 


7T*(x) ^ 7T*(x) = 1 — (1 + — ^ /Y , X > 0, 
for a suitable choice of constants a n ; see also section 5.3.1. 


(10.23) 


Theorem 10.17 (Leadbetter 1991) Let {X/} be stationary with extremal index 0 > 
0. Let there exist a sequence of thresholds u n for which A f (u n ) holds and nF(u n )-> 
r G (0, oo). Let there exist positive integer sequences r n and s n such that s n = o(r n ), 
r n = o(n), and yiol' (n, s n ) = o(r n ) as n oo. Then Z* —> Z*, where Z* is CP(0r, 
n*) and n* is the GP distribution. 


Theorem 10.17 is the mathematical foundation of the so-called peaks-over- 
threshold (POT) method to be discussed in the next section. 

10.3.4 Statistical applications 

We have seen that the behaviour over high thresholds of certain stationary processes 
can be described by compound Poisson processes, where events corresponding to 
clusters occur at a rate v = Or and the cluster statistics follow a mark distribution 
7t. For a realization {x z }^ =1 , suppose that there are n c clusters at times {tJ} n j C =1 in 
(0,1] and with marks We could fit the model by maximizing the likelihood, 

n c 

L(u,7T ； t*,y*) = s~ v v n ^ ]~[ 7T(y*), (10.24) 

7 = 1 

see, for example, section 4.4 of Snyder and Miller (1991). The form of the like¬ 
lihood means that v = n c independently of n. If we have a parametric model 
for 7r, then its maximum-likelihood estimate can be found, and it depends on the 
marks only. But the asymptotic theory specifies 丌 only when the mark is the peak 
excess (Theorem 10.17), in which case n is the GP distribution. For other cluster 
statistics, we can either choose a parametric model for n or estimate it with the 
empirical distribution function of {y^} n ^ =x . 

Estimating v and n relies on being able to identify clusters in the data. This 
problem, known as declustering, is not trivial because we observe only a finite 
sequence, and so clusters will not be defined at single points in time; rather, they 
will be slightly spread out and it may not always be clear whether a group of 
exceedances should form a single cluster or be split into separate clusters. Declus¬ 
tering is intrinsically linked to the extremal index, which we have seen is important 
also for its influence on marginal tail quantiles and return levels (section 10.2.3) 
and for its interpretation as the inverse mean cluster size (section 10.3.1). We 
continue this section by first discussing estimators for the extremal index and 
then exploring the connection with declustering before returning to estimation 
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of the compound Poisson process. An alternative method for estimating cluster 
characteristics, which does not use the compound Poisson model, is described in 
section 10.4, where the evolution of the process over high thresholds is modelled 
by a class of Markov chains. 


Estimating the extremal index 


Our first characterization (10.10) of the extremal index was as the limiting ratio 
of P[M Vri > u n ] to r n F{u n ). If we choose a threshold u and a block length r, 
then natural estimators for the quantities P[M r > u] and F(u) lead to the blocks 
estimator for the extremal index: 


e^(u \ r )= 


Y^j=\ > U) 

rn - 1 Ei=i HXi > u) 


(10.25) 


where k = |_ w / r 」.This can be improved by permitting overlapping blocks, giving 
the sliding-blocks estimator, 


\ r) 


1)— 1 二 > u) 

rn- 1 1 (& > m) . 


Our second characterization (10.11) was in terms of the probability that an 
exceedance is followed by a run of observations below the threshold. If we choose 


a threshold u and a run length r, then we can estimate the quantities P[X\ > 
u, < u] and F(u) to obtain the runs estimator for the extremal index: 


0„(u\ r )= 


in- r)- 1 1(^- > u, M Ui+r < u) 

»— 1 E/=i 1(^ > m) 


(10.26) 


The extremal index is also related to the times between threshold exceedances. 
We saw in Theorem 10.11 that the point process of exceedance times normalized 
by l/n has a compound Poisson limit. Therefore, the corresponding times between 
consecutive exceedances are either zero, representing times between exceedances 
within the same cluster, or exponential with rate Or, representing times between 
exceedances in different clusters. Since we expect r = \imnF(u n ) exceedances in 
total but only Or clusters, the proportion of interexceedance times that are zero 
should be 1 — 

Formally, for u such that F{u) < 1, define the random variable T(u) to be the 
time between successive exceedances of w, that is, 


P[T(u) > r] = P[M\ \ +r < u \ X > u]. 

Ferro and Segers (2003) showed that, under a slightly stricter mixing condition 
than D(u n ), for t > 0, 

P[F(u n )T(Un) > ’]= 尸 [■^l ， l+L"/ ； (Mn )」—\ ^ u n\ 

=< U n I X\ > U n ]P[Al^ n t/rj ^ ^n\ "I" ^(1) 
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= e^(u n )P[M rn < u n f nt/T + 0(1) 

—> 0 Qxp(-Ot), n oo. (10.27) 


In other words, interexceedance times normalized by F{u n ) converge in distribution 
to a random variable Tq with mass 1—0 at ^ = 0 and an exponential distribution 
with rate ^ on / > 0. The reason that the rate is now 0 and not Ox is that we have 
normalized by F(u n ) 〜 r/w instead of l/n. In fact, the result also holds under 
D(u n ), see Segers (2002). 

The coefficient of variation, v, of a non-negative random variable is defined as 
the ratio of its standard deviation to its expectation. For Tq, 

l + v 2 = E[T e 2 ]/{E[T e ]} 2 = 2/e. (10-28) 


The interexceedance times are overdispersed compared to a Poisson process, that 
is, y > 1 and exceedances occur in clusters in the limit, if and only if ^ < 1. The 
case of underdispersion (v < 1), in which exceedances tend to repel one another 
requires long-range dependence and is prevented by the D{u n ) condition. 

Suppose that we observe N = N u = YTi=\ 1(^* > w) exceedances of u at 
times \ < S\ < • • • < Sn < n. The interexceedance times are 7} = ^+1 — 5 / for 
i = 1 ， … ， TV — 1. Replacing the theoretical moments of Tq in the ratio (10.28) 
with their empirical counterparts yields another estimator for the extremal index: 


0 n (u)= 


2(Ef=Vr,) 2 

(N-DEl^T , 2 


Since the limiting distribution (10.27) models the small interexceedance times as 
zero, while the observed interexceedance times are always positive, a bias-adjusted 
version, 


, 20 - 1 ) 

0 ： (u) = - L —^ -- , 

is preferable when max{7] : l < i < N — 1} > 2. Unlike the blocks and runs esti¬ 
mators, these two estimators are not guaranteed to lie in [0, 1] so that the constraint 
must be imposed artificially. Doing so yields the intervals estimator for the extremal 
index: 


纪 ⑷ = 


1 A O n (u) 
1 八命⑻ 


if max{7 ； : 1 < / < - 1} < 2, 
if max{7 ； : 1 < / < — 1} > 2. 


(10.29) 


The blocks and runs estimators are used by Leadbetter et al. (1989) and Smith 
(1989); a variant of the blocks estimator is proposed by Smith and Weissman 
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(1994). Calculations of asymptotic bias by Smith and Weissman (1994) and Weiss- 
man and Novak (1998) suggest, however, that the runs estimator should be pre¬ 
ferred. Asymptotic normality has been established under appropriate conditions by 
Hsing (1993) and Weissman and Novak (1998). The choice of auxiliary parameter, 
r, for both the blocks and runs estimators is largely arbitrary. It may be guided by 
physical reasoning about the likely range of dependence in the underlying process 
(Tawn 1988b) or parametric modelling of the evolution of extremes (Davison and 
Smith 1990). Alternatively, estimates with different r may be combined (Smith 
and Weissman 1994). The attraction of the intervals estimator (Ferro and Segers 
2003) is its freedom from any auxiliary parameter. 

Still more estimators can be found in the literature. For example, Ancona- 
Navarrete and Tawn (2000) derive estimators from Markov models fitted to the 
data (see also section 10.4). Gomes (1993) constructs an independent sequence by 
randomizing the data and then fits GEV distributions to sample maxima from both 
this and the original sequence. Since the parameters (10.7) of the two distributions 
are related by the extremal index, an estimator for 0 may be obtained as a combina¬ 
tion of the parameter estimates. A comparative study is made by Ancona-Navarrete 
and Tawn (2000). 

The estimator of Gomes (1993) has the merit that it does not require the selec¬ 
tion of a threshold, although it does require the selection of a block length to obtain 
a sample of maxima M n . Threshold choice is a fundamental issue: the estimators 
presented in this section estimate a quantity 0(u) rather than 0 = \im 0(u). Hsing 
(1993) considers threshold selection for the runs estimator and proposes an adap¬ 
tive scheme to minimize mean square error under a model for the bias. A more 
common approach is simply to estimate the extremal index using several high 
thresholds and then assume that stability of estimates over a range of thresholds 
indicates that the limit has been reached. 

Declustering the data 

Recall that to estimate the limiting compound Poisson process, we need to decluster 
the data. Several schemes have been proposed in the literature, three of which relate 
to the blocks, runs and intervals estimators for the extremal index. 

Blocks declustering (Leadbetter et al. 1989) is a natural application of the 
definition of clusters given in section 10.3.1. The data are partitioned into blocks of 
length r and exceedances of a threshold u are assumed to belong to the same cluster 
if they fall within the same block. The number of clusters identified in this way 
is the number of blocks with at least one exceedance. The example in Figure 10.4 
identifies two clusters using block length r = 6. The number of clusters is precisely 
the quantity that appears in the numerator of the blocks estimator (10.25) for the 
extremal index, which is therefore the ratio of the number of clusters to the total 
number of exceedances, that is, the reciprocal of the average size of clusters found 
by blocks declustering. 

The runs estimator (10.26) for the extremal index may also be interpreted as the 
ratio of the number of clusters to the number of exceedances, but where clusters 
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Figure 10.4 An illustration of blocks declustering with threshold u and block 
length r = 6. 


are identified by runs declustering (Smith 1989). With this scheme, exceedances 
separated by fewer than r non-exceedances are assumed to belong to the same 
cluster; if r = 0, then each exceedance forms a separate cluster. In Figure 10.4, 
three clusters are identified if the run length is r = 3, but only two clusters are 
identified if r = 4. 

As with the corresponding estimators for the extremal index, the troublesome 
issue for blocks and runs declustering is the choice of the auxiliary parameter, 
r. Diagnostic tools for selecting r have been proposed by Ledford and Tawn 
(2003), while the following scheme, intervals declustering (Ferro and Segers 2003), 
provides an alternative solution. 

Recall that a proportion 0 of normalized interexceedance times are non-zero in 
the limit (10.27), and that these represent times between clusters. If 0 is an estimate 
of the extremal index, then it is natural to take the largest n c — l = \_(N — 1) 沒 」 of 
the interexceedance times T ]，1 < / < — 1, to be these intercluster times. This 

defines a partition of the remaining interexceedance times into sets of intracluster 
times. Note also that, because the point process of exceedance times is compound 
Poisson, the intercluster times are independent of one another, and the sets of 
intracluster times are independent both of one another and of the intercluster times. 
To be precise, if T(„ c ) is the n c th largest interexceedance time and 7^ is the 7 th 
interexceedance time to exceed 7"(„ c )，then is a set of approximately 

independent intercluster times. In the case of ties, decrease n c until T(„ c _i) is strictly 
greater than T( nc ). Let also Tj = •..，where io = 0, i Hc = N and 

7} = 0 if ij = ij-\ + 1. Then {7}}】。is a collection of approximately independent 
sets of intracluster times. Furthermore, each set 7} has associated with it a set of 
threshold exceedances Xj = {Xj : i g Sj], where Sj = {5~_ 1 + i, …， is the set 
of exceedance times. If we estimate 0 with the intervals estimator (10.29), then this 
approach declusters the data into n c clusters without requiring an arbitrary selection 
of auxiliary parameter. In fact, the scheme is equivalent to runs declustering but 
with run length r = T( nc ) estimated from the data and justified by the limiting 
theory. 
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Estimating the compound Poisson process 

Once we have identified clusters Xj = {x,- : i e Sj] for j = \,... ,n c over a high 
threshold u, we can compute the cluster statistics = c{(xi — u)i e sj] correspond¬ 
ing to the marks of the limiting compound Poisson process. We have remarked 
already that v = n c , while n may be estimated by the empirical distribution func¬ 
tion of the cluster statistics, if the theory does not supply a parametric model. 

In the case of the peak excess, n is the GP distribution (Theorem 10.17) and 
may be estimated by maximum likelihood. This is known as POT modelling. Esti¬ 
mation methods, diagnostics and extensions of the model to handle seasonality and 
other regressors are described by Davison and Smith (1990); see also Chapter 7. 
An alternative POT approach is to fit the GP distribution to all of the excesses, 
not only those of the cluster maxima. The idea is justified by the fact (10.21) 
that, in the limit, the distribution of the excess of a cluster maximum is the same 
as that of an arbitrary exceedance, although the correspondence is often poor at 
finite thresholds. By fitting to all of the excesses, we avoid having to decluster the 
exceedances; on the other hand, the excesses can no longer be treated as though 
they were independent, which necessitates a modification of the estimation pro¬ 
cedure. One approach is to adopt the estimation methods appropriate when the 
excesses are independent and adjust the standard errors, which will otherwise be 
underestimated. Several methods for obtaining standard errors in this case have 
been proposed: see Smith (1990a), Buishand (1993) and Drees (2000). 

For any cluster statistic, a bootstrap scheme (Ferro and Segers 2003) that 
exploits the independence structure of the compound Poisson process may be used 
to obtain confidence limits on estimates of v, n and derived quantities, f, such as 
the mean of n. 

(i) Resample with replacement n c — \ intercluster times from {7^} 】 二 1 . 

(ii) Resample with replacement n c sets of intracluster times, some of which may 
be empty, and associated exceedances from {(7}, Pdj)} n j C =1 . 

(iii) Intercalate these interexceedance times and clusters to form a bootstrap repli¬ 
cation of the process. 

(iv) Compute N for the bootstrap process, estimate 0, and decluster accordingly. 

(v) Estimate v, n and f for the declustered bootstrap sample. 

Forming B such bootstrap samples yields collections of estimates that may be used 
to approximate the distributions of the original point estimates. In particular, the 
empirical a- and (1 — a)-quantiles of each collection define (1 — 2a)-confidence 
intervals. Note that, when applied with intervals declustering, this scheme accounts 
for uncertainty in the run length used to decluster the data, as it is re-estimated for 
each sequence at step (iv). 

Alternative confidence limits for the extremal index (Leadbetter et al. 1989) 
rely on the asymptotic normality and variance of the blocks estimator, which may 
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Figure 10.5 The intervals estimator for the extremal index ( — o — ) against thresh¬ 
old with 95% confidence intervals estimated by the bootstrap (.) and the 

normal approximation ( - ). The threshold is marked on the upper axis in degrees 

Celsius. 
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be estimated (Hsing 1991; Smith and Weissman 1994) by 

； r )} 3 么 

E；=i 攸 >«) ’ 

where V s is the sample variance of the cluster sizes, i) r+ i 1(X/ > u ) : 

> u,l < j < ln/r]}. 

10.3.5 Data example 

The intervals estimator for the extremal index of the Uccle temperature data (see 
section 10.2.2) is plotted against threshold in Figure 10.5. In this and subsequent 
plots, thresholds range from the 90% to the 99.5% empirical quantiles, and boot¬ 
strapped confidence intervals are based on the intervals declustering scheme of 
section 10.3.4 with 500 resamples. Note that in Figure 10.5, the lower confidence 
limits estimated by the bootstrap and the normal approximation (10.30) are similar, 
while the upper limits are higher with the bootstrap. The point estimates of the 
extremal index are stable up to the 97% threshold, with values just below 0.5. The 
increase of the estimates above the 97% threshold might indicate that the limit 
has not been reached, and possibly 0 = 1, or could be due to sampling variability. 
We shall return to this question in section 10.4.7; for now, we assume that the 
perceived stability indicates that the limit has been reached and that the limiting 
cluster characteristics of the data can be estimated by fixing a suitable threshold. 
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A cluster of hot days can have serious implications for public health and agri¬ 
culture. By declustering the data, we can obtain estimates for the rate at which 
clusters occur and for the severity of clusters, which can be usefully measured 
with the distributions of statistics such as the cluster maximum, cluster size and 
cluster excess. We have seen already that the mean cluster size is \/0 ^2. 

The intervals declustering scheme, applied with the above estimates of the 
extremal index, enables the identification of clusters at different thresholds. The 
Poisson process rate at which clusters occur is approximately linearly decreas¬ 
ing with threshold exceedance probability according to the approximation n c ^ 
GnF(u). On average, about 1.3 clusters occur over the 90% quantile every July, 
and the rate decreases by about 0.12 for every decrease of 0.01 in the threshold 
exceedance probability. Estimates of the declustering run length r are close to 4 
for all thresholds, indicating that exceedances separated by about four days for 
which the temperature is below the threshold can be taken as independent. 

For the POT model, we describe the excesses of cluster maxima by the GP 
distribution (10.23). The maximum-likelihood estimates of the GP parameters at 
different thresholds are represented in Figure 10.6. The model is fitted twice: 



0.91 0.93 0.95 0.97 0.99 0.91 0.93 0.95 0.97 0.99 


Threshold quantile Threshold quantile 

(c) (d) 

Figure 10.6 Parameter estimates ( — o — ) against threshold for the GP distribution 
fitted to cluster maxima (a) and all exceedances (b) with bootstrapped 95% con¬ 
fidence intervals (.). The scale parameters have been normalized to cr — yu. 

The estimate ( - ) of the shape parameter from the GEV model is also indicated, 

and the threshold is marked on the upper axes in degrees Celsius. 
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Generalised pareto quantiles Unit exponential quantiles 

⑻ (b) 

Figure 10.7 Quantile plots for excesses of cluster maxima (a) and for the nor¬ 
malized interexceedance times (b) over the 96% threshold with 95% confidence 
intervals obtained by simulating from the fitted models. For the interexceedance 
times, the continuous line has gradient 1/0 and breakpoint — logO, where 0 = 0.49. 

first to only the cluster maxima, and second, to all exceedances. As there is 
some disparity between the two fits, we should be wary of using the latter to 
model peaks. Note also that the estimate of the shape parameter is close to —0.5, 
below which the usual asymptotic properties of maximum-likelihood estimators 
do not hold (Smith 1985). Moment estimators give similar point estimates, how¬ 
ever, and the bootstrap confidence intervals do not rely on asymptotic normality. 
For both fits, the parameter estimates are quite stable, perhaps with some bias 
below the 96% threshold, 31°C, at which there are 120 exceedances and 59 iden¬ 
tified clusters. The quantile plots in Figure 10.7 show that the GP model is a 
satisfactory fit at the 96% threshold and that the interexceedance times at this 
threshold are well modelled by their limit distribution (10.27). Furthermore, the 
mean-excess plot is approximately linear above the 96% threshold. We take the 
fit to cluster maxima at the 96% threshold, a = 3.7 and y = —0.59, for our POT 
model. 

The marginal distribution of the temperature data is captured better by the fit 
to all exceedances, so we use the corresponding GP parameter estimates, o = 
2.8 and y = —0.42, to describe the marginal tail. The 99%, 99.9% and 99.99% 
marginal quantiles, with bootstrapped 95% confidence intervals, are 33.9 (33.4, 
34.3), 36.2 (35.6, 36.6) and 37.1 (36.2, 37.8). Compare the first two with the 
empirical quantiles, 33.7 and 36.2. Combining the estimate of the extremal index, 
0.49, at the 96% threshold with this estimate of the GP distribution yields estimates 
of the 100, 1000 and 10000 July return levels: 36.5 (35.7, 36.9) ， 37.2 (36.2, 38.1) 
and 37.5 (36.3, 38.7). The confidence intervals are obtained by bootstrapping the 
extremal index and GP parameters with the scheme described in section 10.3.4. 
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The estimate of the upper end-point is 37.7 (36.4, 39.6). These estimates are lower, 
and the confidence intervals are narrower than the direct estimates from the GEV 
model of section 10.2.2. Bootstrapped confidence intervals have been preferred 
here to methods relying on asymptotic normality or profile likelihoods because 
they easily account for the dependence between threshold exceedances and for the 
uncertainty in the declustering scheme. 

In addition to the cluster maxima, other statistics of interest are the numbers of 
exceedances and the sum of all excesses during a cluster. The empirical estimate 
of the cluster-excess distribution appears later in Figure 10.11. Estimates of the 
cluster-size distribution are presented in Figure 10.8. These appear stable, but again 
there is a hint that 7r(l) - > 1 as threshold increases. The point estimates at the 
96% threshold are tt(1) = 0.61, n(2) = 0.15, n(3) = 0.07 and n(4) = 0.08; 8% 
of clusters have more than four exceedances. These estimates can be combined 
with the GEV model to determine distributions (10.19) of large order statistics 
for July. 

Inspecting the data reveals that clusters tend to comprise only consecutive 
exceedances, maximizing public health and agricultural impacts. This is reflected 
in the distribution, /c, of the maximum number of consecutive exceedances within a 
cluster: at the 96% threshold, the estimate is 广 （ 1) = 0.64, k(2) = 0.17, ^(3) = 0.05 
and ((4) = 0.08, which is very similar to the cluster-size distribution. The mean 
number of up-crossings per cluster is 1.17, with bootstrapped 95% confidence 
interval (1.00 ， 1.43). 
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Figure 10.8 Cluster-size distribution estimates ( — o — ) against threshold for sizes 
1 (a) and 2 (b) with bootstrapped 95% confidence intervals. The threshold is marked 
on the upper axes in degrees Celsius. 
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10.3.6 Additional topics 

Two-dimensional point processes 

In this section, we have considered one-dimensional point processes in which 
exceedance times are associated with marks defined by the exceeding random 
variables Xi . Another instructive approach is to consider two-dimensional processes 
recording time in the first dimension and Xi in the second. The process 


K(.) = ^(i/n,(Xj-b n )/a n )(') 

i=l 

was studied by Hsing (1987)，extending work of Pickands (1971) on independent 
sequences and Mori (1977) on strong-mixing sequences. When the normalizing 
constants are such that (M n — b n )/a n has a GEV limit, G, with lower and upper 
end-points and x*，and the A(u n ) condition holds simultaneously at different 
thresholds, Hsing (1987) shows that any limit of V n has the form 

Ki 

J>1 7=1 


where Si represents the occurrence time of a cluster of points X, j > > X/. 

The times and heights, {(Si, ^/,i)}/>i, of cluster maxima occur according to a two- 
dimensional, nonhomogeneous Poisson process rj on (0, 1] x (*x, x 氺 ) with intensity 
measure —{b — a) log G(x) on (a, b) x [x, x 氺 ）. This corresponds to our discussion 
of the process (10.18) of cluster maxima over a single threshold; see also section 
5.3.1. Further insight is provided by the relationship between cluster maxima and 
the remaining points in a cluster. For each cluster, the points 


— log G(Xjj) 
-\ogG(X iA ) 


< j < Kt, 


occur according to an arbitrary point process 仏 on [1, oo) with atom F/j = 1, and 
these point processes are independent, identically distributed and independent of rj. 
More general normalizations than linear ones are considered in Novak (2002). 


Tail array sums 

Sometimes, we are interested in summaries of not just characteristics of individual 
clusters but also the cumulative effect of all exceedances over a longer-time period. 
Useful measures for such cumulative effects are tail array sums (Leadbetter 1995; 
Rootzen et al. 1998), 


W n = Y J ^i-^n), 


( 10 . 31 ) 
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for functions 0 satisfying 0(x) = 0 when x < 0 as in section 10.3.2. Note that we 
can decompose W n as 


kn 

；=1 

where Jj are the blocks in (10.2) and W n (Jj) = 0(^/ ~ u n) are the 

block sums. 

The tail array sum is related by W n = O n (0, 1] to the point process 

^n(*) ~ 〉: — X = {i X( > U n , 1 ^ 5 "}， 

ieX 

of which we have seen examples in sections 10.3.1 and 10.3.3. Therefore, whenever 
has a compound Poisson limit with mark distribution determined by the 
distribution of W n (J\) conditional on M{J\) > u n , W n will converge in distribution 
to ^2^=1 Wj, where _/V c is a Poisson random variable representing the number of 
clusters and the Wj are independent random variables with distribution 丌 The 
compound Poisson model does not provide a finite-parameter characterization for 
the limit distribution of W n , except in cases where is known. 

Previously, the number of clusters had a Poisson limit because its expectation 
was controlled by nF(u n ) —> r < oo. If, however, the thresholds are such that 
nF(u n ) oo, then we might hope to obtain a central limit theorem for W n as 
the sum of a large number of block sums. To obtain non-degenerate limits, we 
normalize using 


C^n = k n VSLY{W n (Ji)} (10.32) 

and restrict the dependence with A 中 (u n \ defined to be the same condition as 
A(u n ) but with J-j^k{u n ) = a{(j)(Xi — w n )+ : j < i < k] and mixing coefficients 
a^(n, s). With the usual moment conditions, we obtain the following result. 


Theorem 10.18 (Leadbetter 1995) Let there exist a sequence of thresholds u n 
for which A^(u n ) holds, nF(u n ) -> oo and E[cj) 2 (Xi — u n )] < oo. Let there 
exist a positive integer sequences r n and s n such that s n = o(r n ), r n = o(n), 
and na^(n, s n ) = o(r n ) as n ^ oo and such that the Linde berg condition, 
k n E{W^l(\W nl \>s)}->0asn —> oo for all £ > 0, holds with W n \ = [W n (J\) — 
E{W n (J\)}]/cr n and k n = \n/r n \. Then, 

a~ l {W„-E(W n )} ^ 

where W has a standard normal distribution. 


Theorem 10.18 says that we may model W n by a normal distribution, reducing 
inference to estimation of its mean and variance. The mean may be estimated by 
the observed value of W n and the variance by substituting the sample variance of 
the W n (Jj) into expression (10.32). 
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10.4 Markov-Chain Models 

In the previous sections, we did not make any assumptions at all on the form of 
dependence among the variables X“ except for a restriction on long-range depen¬ 
dence at extreme levels. This generality is of course attractive from a mathematical 
point of view, but leaves us with little means to analyse, for instance, the structure 
of clusters of high-threshold exceedances except for the usual empirical estimates 
obtained after application of a declustering scheme. As we saw earlier, the choice 
of such a scheme may be subjected to large uncertainty (which was quantified by 
our bootstrap scheme) and, moreover, if there are only a few clusters of extremes, 
then the empirical estimates are not very informative. 

A possible way out of this problem is to make more detailed assumptions 
about the dependence structure in the series, for instance, by assuming some kind 
of (semi-)parametric model. In the present section, we focus on Markov chains 
for which the joint distribution of a pair of consecutive variables satisfies some 
regularity at extreme levels. Other time-series models are considered briefly in 
section 10.6. 

The Markov-chain approach is successful because, under weak assumptions, 
the distribution of the chain given that it started at an extreme level, the so- 
called tail chain, can be represented in terms of a certain random walk, while the 
extremal index and, more generally, the distribution of clusters of extreme values 
can be written in terms of this tail chain (Perfekt 1994; Smith 1992). Moreover, 
an approximate likelihood can be constructed from which the Markov chain can 
be estimated, and the tail chain subsequently derived, given a set of data (Smith 
et al. 1997). 

10.4.1 The tail chain 

Let {X n } n >\ be a stationary Markov chain. We assume that the joint distribu¬ 
tion function F(x\,X 2 ) of (X\, X 2 ) is absolutely continuous with joint density 
f(x\, X 2 ). Denote the marginal density of the chain by f(x) and the marginal 
distribution function by F(x), and let = supjx g M : F(x) < 1} be its right 
end-point. The Markov property entails that for every positive integer n, the joint 
density of the vector (Xi,..., X n ) is equal to 

n 

f(xu ...,X n ) = f(x\)Y\f(Xi I X/_i) 
i=2 


n / n—1 

]~[ / Y[ /(A). 

，•— o / 0 


(10.33) 


We shall model the extremes of the chain under the assumption that the joint 
distribution of (X\, X 2 ) is in the domain of attraction of a bivariate extreme value 
distribution G(x\, X 2 ). Without loss of generality, we take the identical margins of 
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G to be the standard extreme value distribution with shape parameter y g R: 
G y (x) = exp{—(1 + yx) _1 / y }, \ yx > 0. 


If the distribution of (X\ y X 2 ) is in the domain of attraction of G, then, by Pickands 
(1975) and Marshall and Olkin (1983), there exists a positive function a (w), m < x*, 
such that for x, x\, X 2 with 1 + yx > 0 and 1 + yxi > 0 ( / = 1,2) we have 


1 — F{u + o{u)x] 
1 - F(u) 


-> (l + yx)- 1 ^, 


1 — F{u + cr(u)x\, u + cr(u)x 2 ] 
1 - F(u) 


-> V(x\,x 2 ), 


(10.34) 

(10.35) 


as w 个 x*, where V(x\, X 2 ) = — log G(x\,X 2 )', see also equation (8.69). 

Our model for the extremes of the chain and the methods of inference will 


be based on the limiting distribution of the vector {(X, — «)/cr(w)}^ =1 condi¬ 
tionally on X\ > u, where m is a positive integer. We shall show now that a 
non-trivial limit indeed exists provided we enforce conditions (10.34)-(10.35) to 
density convergence. 

As a preliminary, we take a closer look at the extreme value distribution G. 
From section 8.2, we recall the following facts. The function 


GMuZi) = G 



0 < z/ < 00 (/ = 1, 2) 


is a bivariate extreme value distribution with standard Frechet margins, and there 
exists a positive measure H on the unit interval [0, 1] so that 

K(Zi ， Z 2 ) =-logG*(zi ， Z 2 ) = f max{w;/zi, (1 - w)/zi}H(dw). (10.36) 

^[ 0 , 1 ] 

The measure H is called the spectral measure, and it necessarily satisfies the 
constraints 

I wH(dw) = l=[ (1 - w)H(dw). (10.37) 

For the sake of simplicity, we make the following assumption. 


Condition 10.19 The spectral measure H is absolutely continuous with continuous 
density function h(w) for 0 < w < l. 

This condition poses a restriction indeed. For instance, it prohibits the margins 
of G to be independent, in which case H is concentrated on 0 and 1. Some 
parametric models, such as the asymmetric logistic (Tawn 1988a) in Example 8.1, 
also allow H to have non-zero mass at 0 and 1. The arguments below can be 
extended to cover these cases as well (Perfekt 1994; Yun 1998). 
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Under Condition 10.19, the function V* is twice differentiable, and, denoting 
partial derivatives by appropriate subscripts, we have by equation (8.36) 

v^nizuzi) = -(Z 1 + Z 2 )~ 3 h{z\/(z\ +Z2)} (10.38) 

for 0 < Zi < oo ( / = 1, 2). As for (xi, X 2 ) such that 1 + yxt > 0 ( / = 1, 2), 

V(xi,x 2 ) = V*(zi, Z 2 ), Zi = (1 + yxi) l/y (i = 1, 2), (10.39) 

the function V is twice differentiable too, and we can formulate an assumption 
extending conditions (10.34)-(10.35) to densities. 


Condition 10.20 The function V is twice differentiable, and for x, x\, X 2 such that 
l yx > 0 and 1 + yx/ > 0 (i = 1,2) we have 似 w 个 


a{u)f{u + a{u)x) 
1 - F(u) 


-> (l + yx)- 1 ^- 1 , 


cr(u) 2 f{u + a(u)x\, u + cr(u)x 2 } 
1 - F(u) 


-Vi2(Xi,X 2 ). 


Under Condition 10.20, we can find the limit of the joint density of the vector 


{(X/ — u)/g (w )}^ =1 conditionally on X\ > u. For x\ and X 2 such that 1 + yxi > 0 
for / = 1, 2, we find 


a{u)f{u + o{u)x 2 I u + a{u)x\] 


a 2 {u)f{u + cr(u)x\, u + a{u)x}/F(u) 
cr(u) f{u + o {u)x\} / F {u) 


-> -(1 + YXi) l,Y+X V n {xi,X 2 ), « 个 X*. (10.40) 

Hence by (10.33), the joint density of {(X/ — u)/o (u)}^ conditionally on X\ > u 
in (xi,..., x m ) such that xi > 0 and 1 + yxi > 0 for i = 1,..., m satisfies 

cr m (u)f{u + o(u)x \, …， w + cr(u)x m ]/F(u) 

m 

-> (1 + KXi)- 1 ^- 1 n(i + yxi^y+^-Vnixi-uXi)}, (10.41) 
/ =2 


as w t x*. 

Now let r be a standard Pareto random variable, P[T > t] = l/f for 1 < t < 
oo, and let {A/}/>i be independent, positive random variables, independent of T, 
and with common marginal distribution 

P[A < a] = f wh(w)dw = 0 < a < oo. (10.42) 

Jl/(l+a) 
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Let {Y n }n>\ be the Markov chain given by the recursion 

ry - 1 


Yi 

Yn 


y 


yY n -i)A y n _ 


n>2, 


(10.43) 


or explicitly 


Y n 


[TU'ZlAt 


n > 2. 


y 


(10.44) 


The random variable Y\ has a GP distribution with shape parameter y. For n > 2 
and (x n -\, x n ) such that 1 + yxt >0(/ = n — 1, n), the density of Y n conditionally 
on Y n _\ = x n -\ is, denoting n = (1 + equal to 


d 

dx, 


P\^n — I ^n—\ ~ ^n—\\ 
d 




— Zn/ Zn—\\ 


(1 + Z n /Zn-l)~ h{(l + Zn/Zn-l) }Z, 

dZn 


-i 

~ [ dx n 


_ 1 ， Zn)Z n _\ 


dx n 


-(1 + yx n - 1 ) 1/K+1 V i2 (x n -i,x n ) 


(10.45) 


where we used subsequently (10.43) ， (10.42) ， (10.38)，and (10.39). 

Combining (10.41) with (10.45), we obtain that under Conditions 10.19 
and 10.20, for all positive integer m, 


P 




X m 


g{u) 


cr(u) 


e • 


X\ > u 


P[(Ti， …， D e •]， 


(10.46) 


as w 个 x*. The process {Y n } is called the tail chain of the Markov chain {X n }. It 
describes the behaviour of the latter when started at a high value X\ > u. Recall 
that the tail chain is completely determined by the extreme value index y and the 
distribution of A; to find the approximate distribution of (Xi,..., X m ) conditional 
on X\ > u, we also need the scaling parameter a{u). Finally, observe that (10.40) 
and (10.45) yield a convenient interpretation of the distribution of A in that 


lim P 

n-^-oo 


l + K 


X 2 — u 
a{u) 


1 1/K 



< a 

X != 

=u 


-5 - P[A < a], w 个 x*. (10.47) 
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Example 10.21 A popular parametric model for \4 is the logistic model 

V^(z\, zi) = (zi l/a + ^2 1/Q? ) » 0 < zj < oo (j = 1, 2) 

with parameter 0 < a < 1, see (9.6). The case a = 1 corresponds to independent 
margins, in which case the spectral measure H puts unit mass on 0 and 1, violating 
Condition 10.19. If 0 < a < 1, however, direct computation reveals that 

P[A < a] = -V*i(l, a) = (1 + 0 < a < oo. 


Without extra assumptions, we can find the limit behaviour of Y n as n ^ oo. 
Observe first that by (10.42) and (10.37), we have 

疒 oo 疒 oo /»l/(l+a) 

E(A) = j P[A > a]da = I I wh(w)dwda 

Jo Jo Jo 

1 \ 

- 1 I wh(w)dw = 1. 

w ) 

By Jensen’s inequality, —oo < ^{^(A)} < 0. Therefore, if A\, A 2 ,... are inde¬ 
pendent copies of A, then by the law of large numbers X^ =1 log (A/) ——00 and 
thus ]~|7=i A/ — 0 as n —> 00 . We obtain that 



lim Y n = 


— l/y if / > 0 
—00 if / < 0 


(10.48) 


with probability one. In particular, only a finite number of the Y n are positive. 
The interpretation is that clusters of exceedances over a high threshold necessarily 
remain of finite length. 

As mentioned before, Conditions 10.19 and 10.20 are not really necessary. A 
more general theory, formulated directly in terms of the transition kernel of the 
chain, is developed in Perfekt (1994). The main conclusions of this section remain 
valid in the more general framework: the representation (10.42) of the distribution 
of A in terms of V*，the representation (10.43) of the tail chain {Y n }, and the 
limit distribution (10.46). What changes is that the distribution of A need not be 
absolutely continuous anymore. In particular, A may have a point mass at zero, 
in which case an absorbing state for the tail chain is —l/y if y >0 and —00 
if y $ 0. Also, it can happen that P[A = 1] = 1, corresponding to asymptotic 
complete dependence of the distribution of (X\, X 2 ) (section 8.3.2), in which case 
Y n = Y\ for all n > 1, violating (10.48). 


10.4.2 Extremal index 

Suppose as in section 10.4.1 that {X n } is a stationary Markov chain with tail chain 
{Y n } satisfying (10.46). We want to express the extremal index 0 of the Markov 
chain, provided it exists, in terms of the tail chain. This will allow us at a later 
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stage to estimate the extremal index when we have estimated the tail chain from 
the data. 

Recall our notation = max{X J+ i, … ， X^} (with max0 = —oo) and = 
Mo,^ for integers 0 < j < k. In section 10.2.3, we saw that under suitable assump¬ 
tions, the extremal index 0 is the limit of 0^(u n ) = P[M\ Jn < u n \ X\ > u n ]. We 
shall find now that the limit of 0^(u n ) is determined by the tail chain. Throughout, 
we assume Condition 10.8. 


For m = 2, 3,..we have 


( u n) - P 

max Yi < 0 


_ i>2 J 


^ P\.^m,r n ^ | 义 1 > W w ] + P 


max Y[ > 0 

i>m 


^[^1,171 — 1 -^1 ^ — P 

max Yi < 0 


2<i<m 


(10.49) 


By (10.46)，the last term on the right converges to zero as n ^ oo. Hence, 
lim sup 


e^(u n ) - p 

max Yi < 0 


_ i>2 —」 


< lim sup P[M mJn > u n \ Xi > u n ]-\- P 


max Y[ > 0 


Since m was arbitrary, we can let m —> oo to obtain, by (10.8) and (10.48), 


0 = lim 0^(u n ) — P 


max Yi < 0 
i>2 ~ 


(10.50) 


Observe that 0 is indeed determined solely by the dependence structure in the 
chain: by (10.44), 


0 = P 


max 

i>l 


< U 


(10.51) 


(Perfekt 1994), where U, A\, A 2 ,... are independent random variables with U 
uniformly distributed on (0, 1) and the A/ distributed like A in (10.42). 


10.4.3 Cluster statistics 

Let c be a cluster functional (Definition 10.13) that is continuous almost every¬ 
where. All the examples in section 10.3.2 satisfy this requirement. By Proposi¬ 
tion 10.15, the distribution of the cluster statistic c{(Xi — u n )/a{u n )} r ^ x condi¬ 
tional on M rn > u n converges to a limit that can be expressed in terms of the tail 
chain [Yi]. 
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Using a similar decomposition as in (10.49)，we obtain from (10.8) ， (10.48) 
and (10.20) 


P 





c 

i cr(M„) f ;=1 

Af rn > u n 


4 0- 1 P[c(Y u Y 2 ,...)e-]~ P[c(Y 2 , F 3 ,...)e-,maxF, >0] 


(10.52) 


(Yun 2000a). Here we have extended the domain of c to sequences xi, X 2 , … with 
only a finite number of positive members by setting c{x\, X 2 ,...) = c(x \,..., x r ) 
where r is such that x,- < 0 for all i > r. 


10.4.4 Statistical applications 

In a practical data analysis, we might want to estimate the extremal index, for 
instance, to estimate high return levels as in section 10.2.3, or the distribution of 
a cluster statistic, for instance, the probability that the total amount of rainfall 
during a storm exceeds a high level. If we are willing to assume that the data 
(xi,..., x n ) are a realization of a sample (Xi,..., X n ) from a stationary Markov 
chain satisfying the conditions of the previous sections, then we can use (10.51) 
and (10.52) to solve these problems. 

Consider first the expression (10.51) for the extremal index. Given the bivariate 
extreme value distribution G, we can compute the distribution of the A/, and then 
find 0 in (10.51) by simulation or some other numerical technique. A fast method to 
compute the extremal index based on (10.51) that does not rely on direct simulation 
from the tail chain, but on the fast Fourier transform, is described in Hooghiemstra 
and Meester (1997). 

For cluster statistics, we are usually interested in c{(Xi — Un)} r ^ =x without the 
normalizing cr(u n ). If c is invariant to scale, for example, if it depends only on 
1(X( > u n ), then we can estimate the distribution of the cluster statistic by simulat¬ 
ing the tail chain {[•} for 1 < / < maxjj > 1 : Yj > 0} according to the definition 
(10.43). In practice, we simulate Y\,..., Y r , with r large enough such that the 
probability of a cluster being longer than r is negligible. Alternatively, if the dis¬ 
tribution of the Ai has mass at {0}, an absorbing state, we can generate r — 1 
from a geometric distribution with mean l/P[A = 0]. Simulating a large number 
of realizations of the tail chain allows the limit (10.52) to be approximated by a 
Monte Carlo average. 

In cases where the normalization is needed, we must fix a threshold u and then, 
by (10.46), we can approximate the distribution of the cluster statistic conditional 
on the cluster maximum exceeding u by 


0~ l P[c(aY\, ctY 2 , — P[c{aY 2 , crY^, •••)€•，max > 0] 

~ 一 i>2 


where o — a{u). 
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A remarkable feature of these applications of the tail chain, which were invented 
by Yun (2000a), is that it requires knowledge of only the limiting forward transition 
probabilities. The sampling scheme of Smith et al. (1997) works differently: (1) 
generate a cluster maximum from the appropriate GP distribution as in (10.23); 
(2) generate the part of the cluster following the cluster maximum from the forward 
tail chain, rejecting samples that exceed the cluster maximum; (3) generate the part 
of the cluster preceding the cluster maximum from the backward tail chain, again 
rejecting those that exceed the cluster maximum. The backward tail chain, defined 
analogously to the forward tail chain, has transitions A/ with distribution function 

X 2 = u = — V"*2(a ， 1). 

Although this scheme is intuitively straightforward, it is clearly less efficient than 
Yun’s scheme, which only requires the forward tail chain and, in which no samples 
need to be rejected. On the other hand, a benefit of the Smith et al. (1997) scheme 
is that it generates clusters directly, the empirical distribution of which can be used 
immediately as an estimate of the cluster distribution. A theoretical justification of 
the scheme is provided in Segers (2003b). 


P[A < a] = lim P 


i + y 




i/y 


cr(u) 


< a 


10.4.5 Fitting the Markov chain 

It remains to estimate the marginal parameters, y and a = cr(u), and the distribu¬ 
tion of the Ai or, equivalently, the function in (10.42). The estimation procedure 
basically consists of the censored-likelihood approach (section 9.4.2) as in Ledford 
and Tawn (1996), but now adapted to the Markov likelihood (10.33) as in Smith 
et al. (1997). 

First we define our models for the marginal and joint distribution functions 
F(x) and F(xi, X 2 ) in the regions x > u and xi > u (i = 1,2) for a sufficiently 
high threshold u. Denote X = X(u) = 1 — F{u) and a = a{u). Equation (10.34) 
suggests the approximation 


/ x — u\~ l ^ y 
F(x) ^ 1 - A 1^1 + y— — J , 

while from (10.35), using (10.39) and (10.36), 

( x\ — u X 2 — u\ 

^—) = 1 - K(zi,z 2 ), (10.53) 

/ r- -u\ l/Y 

with Zi - = + = 1,2. (10.54) 


Slightly more accurate would be to use the tail equivalent models (9.67) and (9.68), 
but for simplicity we stick to the models above as in Smith et al. (1997). 
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As the models above are specified only for observations exceeding the thresh¬ 
old u, we must treat observations below the threshold as being censored at that 
threshold. Specifically, the marginal likelihood for a single observation x is set 
equal to 


fuM = 


X ( x — u\~ 

A l+y ^) + 


i/y-i 


1 -入 


if x > u, 
if x < u, 


and the joint likelihood of a pair (xi, X 2 ) is set equal to 


fu(x\,x 2 ) 


3 2 

„ „ F(xi,x 2 ) 

dXidX2 


-■^-^^ 12 ( 21 , 22 ) 

OX\ 0X2 

if X\ > U, X2 > u 

~- F(xi, u) 

OX\ 


OX\ 

if X\ > U > X2 

a 

7 ~ X2) 

OX2 


Zi) 

0 x 2 

if X\ < U < X2 

F(u, u) 


1 - 火入一 1 ) 

if X\ < u, X 2 < u, 


subscripts on denoting partial derivatives and with (z\, Zi) as in (10.54). Finally, 
the censored likelihood of a sample (xi,..., x n ) is defined by replacing / with f u 
in (10.33). 

Usually we assume that the function V* belongs to some parametric family, 
V^(- I 0) say, and estimate the unknown parameters (y, a, 0) by maximizing the 
censored likelihood; X can be set equal to the ratio of the number of exceedances 
to n. Four such models for are listed below; see section 9.2 for a more exten¬ 
sive list. Once we have estimated the model, we can implement the simulation 
schemes of the previous section to obtain estimates of the extremal index and prop¬ 
erties of cluster statistics. Confidence intervals can be obtained by bootstrapping 
the observed Markov chain according to the scheme described in section 10.3.4 
and refitting the model to each sequence. An alternative, more crude, approach 
could be to resample the maximum-likelihood parameter estimates from their esti¬ 
mated asymptotic multivariate normal distribution, assuming the usual properties 
of maximum-likelihood estimators hold. 


Parametric models 

For easy reference, we repeat here a couple of parametric models for together 
with the corresponding distribution for A as in (10.42). 

Asymmetric logistic model (Tawn 1988a,b) 

VMl,Z2) = (1 _ + (1 — ^2)^2 1 + WlAl) 1/a + ( 她 2) 1A T 
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for 0 < \//i < l (i = 1,2) and 0 < a < 1, see (9.7). The logistic model arises as a 
special case, if = t/t 2 = 1. If 0 < a < 1, the associated transition distribution 
has P[A = 0] = l — \j/i and 

P[A < a] = l — xj/i 伞 ]^ a -” 。1 、 。1 - 1 ， a > 0. 

In case a = 1, we have P[A = 0] = 1 regardless of the 中卜 

Asymmetric negative logistic model (Joe 1990) 

VMu Z2) = zT 1 + Z2 1 - {(zi/xInY + (z2/ir 2 yr 1/r 

for 0 < < 1 ( / = 1,2) and r > 0, see (9.13) where a = —l/r. The associated 

transition distribution has P[A = 0] = 1 — and 

P[A < a] = 1 - + ， 2 _r a r ) _1 " _1 ， a > 0. 

In the limiting case r = 0, again P[A = 0] = 1. 

Bilogistic model (Smith 1990b) 

V*(Zl, Z2) = Z^q 1 - 01 + 2^(1 - 
for 0 < a < 1, 0 < < 1, and where q = q(z\, zi) solves 

(1 一 a)z\\\-q? = (1- P)Z 2 l q a , (10.55) 

see (9.9). The associated transition distribution is 

P[A < a] =q 1 ~ a , a>0, 

where q solves (10.55) when zi = 1 and zi = a. 

Negative bilogistic model (Coles and Tawn 1994) 

VMl,Z2) = Zi l +Z2 1 - V +a + 22*(! - ?) 1+ ^} 

for of > 0, > 0, and where q solves 

(1+ a)z^ l q a = (1 + P)Z 2 l a~ q) p . (10.56) 

The associated transition distribution is 

P[A < a] = 1 -q l+a , a >0, 

where q solves (10.56) when zi = 1 and zi = a. 

Symmetric models are obtained from the first two models when \j/\ = ^2 or 
from the last two models when a = 
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10.4.6 Additional topics 

Threshold dependence 

Model (10.53) for the Markov chain assumes that the dependence between consec¬ 
utive exceedances of a high threshold does not change as the threshold is increased. 
This is acceptable if we really are interested in the asymptotic properties of the 
process. Typically, however, we are interested in high, but finite, levels at which 
the process may behave very differently. For example, if the joint distribution of 
(Xi, X 2 ) is in the domain of attraction of an extreme value distribution with inde¬ 
pendent margins, that is, X\ and X 2 are asymptotically independent, then 0=1 and 
there is no clustering in the limit. Clustering may occur at finite levels, however, 
and inferences such as return-level estimation can be improved, if we recognize 
that 0(u) < 1. The asymptotically dependent model (10.53) is particularly inad¬ 
equate in this situation because 0=1 can be achieved only if X\ and X 2 are 
completely independent. In this section, we obtain threshold-dependent estimates 
of the extremal index and cluster statistics by extending the model (10.53) and using 
a penultimate approximation to the tail chain (10.43); see Bortot and Tawn (1998). 

The model for the distribution of (X\, X 2 ) in the joint-tail region x,- > u (i = 
1,2) is taken from Ledford and Tawn (1997); see also section 9.5. Specifically, 

F(xi,x 2 ) ：= P[Xi > xi, X 2 > x 2 ] 

^ (1 - zr 1 ) + (1 _ ^ 1 ) - 1 + i3(zi ， Z2)q C1 q C2 ， (10.57) 

where zi ^ l/F(xi) is the transformation (10.54)，£ is a bivariate slowly vary¬ 
ing function, and c\ and are positive parameters satisfying ci + C 2 > 1. The 
coefficient of tail dependence, rj, defined by the limit 

lim F(tx, tx)/F(t, t) = 0 < x < 00 , (10.58) 

t^-oo 

is r] = l/(ci + C 2 ). If ci + C 2 > 1 then 77 < 1 and thus P[X 2 > x | Xi > x] —> 0 
as x 4 x*, that is, the pair (Zi, X 2 ) is asymptotically independent. In that case, 
we obtain P[A = 0] = 1 in (10.42), and the extremal index (10.51) is equal to 
unity, that is, there is no clustering in the limit. 

Estimation proceeds with the censored likelihood of section 10.4.5 adapted to 
the new model, a possible parametric form for C being 

= a 0 + (Z 1 Z 2 )— 1/2 {zi +Z 2 - ziZ 2 V^(zi, Zi)}, (10.59) 

with ao > 0 and where \4 is one of the parametric models listed in section 10.4.4. 
The special case c\ = C 2 = 1/2 and ao = 0 leads back to the previous 
model (10.53). 

Suppose now that we want to find the extremal index or the distribution of 
a cluster statistic at some finite threshold u\ > u. We can still use the tail-chain 
approximation (10.46), replacing u with u\, and where {Y n } are defined by (10.43). 
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However, instead of simulating the A/ from their degenerate limit distribution, we 
use (10.47) to simulate from the penultimate form 




\ x 2 - H l/y 


Fa(ci ； v) = P 



X\ = v 


1 - , X- 1 ) - X- 1 CdaX- 1 , X.- 1 )}, 


with 入 = F(v). Since this distribution depends on the particular value of the con¬ 
ditioning variable, v, the A ； are no longer identically distributed: given F/, we 
simulate A, from F^{*; u\ + <y(u\)Yi}. The tail chain can be simulated either for a 
fixed time r, as in section 10.4.4, or stopped when X( = u\ + a{u\)Yi falls below 
u, at which point the justification for the model is lost. 


Non-parametric estimation 


It is not necessary to fit a bivariate parametric model to obtain the distribution of 
the transitions A/. The transitions satisfy 


A/ 


i + y^i+\ 

. 1 + Y^i , 


i/y 


1 , 2 ,.. 


where Z ; approximates (X/ — u)/a{u) when X\ > u. In the special case that the 
Xi are standard exponentially distributed, we have y = 0, a{u) = 1, and At = 
exp(X/ + i — Xi). For data {xj}\<j< n , therefore, we can define the empirical values 
of Ai to be 


{exp (xj^i - x j+i -i) : xj > u,l < j <n - i}, (10.60) 

where Xj are the data transformed to standard exponential margins, for instance, 
by the empirical distribution function. The transition distribution can be estimated 
with a kernel density estimator based on these empirical values (Bortot and Coles 
2000). Such an estimate also provides a method for assessing the fit of parametric 
models. 


Higher-order Markov chains 

Extremes of J-order Markov chains, d > l, were considered in Yun (1998, 2000a). 
The ideas remain the same, but the appropriate higher-order transition probabilities 
lead to a tail chain that also has order d. Statistical modelling requires a (d + 1)_ 
variate extreme value distribution, suitably restricted to ensure stationarity and 
fitted with the appropriate extension of the likelihood in section 10.4.5. To select 
between models of different order, it is advantageous for the lower-order model to 
be nested within the higher-order model. In this case, the models can be compared 
by evaluating both of them for the higher-order likelihood: the form of the censored 
likelihood means that likelihoods of different orders are not necessarily comparable. 
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10.4.7 Data example 

In this section, we fit first-order Markov models to the Uccle data of section 10.2.2, 
consider the issue of asymptotic independence and compare the simulated cluster 
characteristics to the empirical estimates of section 10.3.5. 

We fit Markov chains with the six asymptotic dependence structures listed in 
Table 10.1 at thresholds ranging from the 90% to the 99.5% empirical quantile. As 
with the compound Poisson models of section 10.3.5, parameter estimates are stable 
above the 96% threshold, and constraining the asymmetric logistic and asymmetric 
negative logistic models to xj /2 = I causes almost no change in the maximum 
likelihood. The model fits at the 96% threshold are summarized in Table 10.1. 

Symmetry corresponds to the hypothesis a = ^ in the case of the bilogistic and 
negative bilogistic models. Under this hypothesis, the models reduce to the logistic 
and negative logistic, and a likelihood-ratio test gives no indication of asymmetry. 
Note that we assume here and elsewhere that standard likelihood properties hold 
even though the censored likelihood is an approximation to the joint density. Simu¬ 
lating test statistics under the null hypothesis is an alternative, but computationally 
expensive, approach. In the case of the asymmetric logistic and asymmetric neg¬ 
ative logistic models with = 1, symmetry corresponds to the boundary value 
x//i = l. This is one example of the nonregular problems encountered in multivari¬ 
ate extremes (Tawn 1988a, 1990); the likelihood-ratio statistic should be compared 
to a one-half chi-squared distribution with one degree of freedom. For the asym¬ 
metric logistic model, the statistic is 2.12 with p-value P[X\ 2.12]/2 = 0.073, 


Table 10.1 Parameter estimates, standard errors, negative log-likelihoods 
and extremal indices for six asymptotically dependent Markov models. The 
asymmetric logistic and asymmetric negative logistic models are constrained 
io \jr 2 = 1, with as special cases for \J/i = l the logistic and negative logistic 
models, respectively. 


Model 

a 

Y 

Dependence 

NLLH 

0 

Logistic 

2.8 

-0.30 

a 

= 0.67 (0.04) 

597.15 

0.54 


(0.4) 

(0.11) 





Bilogistic 

2.7 

-0.29 

a 

= 0.74 (0.05) 

595.89 

0.55 


(0.4) 

(0.11) 

p 

= 0.58 (0.08) 



Asymmetric logistic 

2.8 

-0.30 

a 

= 0.62 (0.06) 

596.09 

0.56 


(0.4) 

(0.12) 

少 l 

= 0.76 (0.14) 



Negative logistic 

2.7 

-0.28 

r 

= 0.77 (0.09) 

597.63 

0.54 


(0.4) 

(0.11) 





Negative bilogistic 

2.7 

-0.27 

a 

= 0.89 (0.04) 

596.71 

0.54 


(0.07) 

(0.04) 

p 

=1.81 (0.07) 



Asymmetric 

2.7 

-0.28 

r 

= 0.92 (0.16) 

596.51 

0.55 

Negative logistic 

(0.4) 

(0.11) 


= 0.75 (0.14) 
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and for the asymmetric negative logistic model, the statistic is 2.24 with p-value 
0.067. We conclude that there is only weak evidence for asymmetry and we proceed 
with the symmetric logistic model. 

We can assess how well the model fits the data with some diagnostic plots. The 
estimated shape parameter is greater than that obtained from the marginal analysis 
(y = —0.42) and the quantile plot for threshold excesses is poor. That there is 
little to choose between the models featured in Table 10.1 is exemplified by the 
similarity of the estimates of the Pickands dependence function A in Figure 10.9. 
Recall from (8.54) that the Pickands dependence function of a bivariate extreme 
value distribution is defined by A(w) = V"*{(1 — uj) _ 1 , u> _1 } for 0 < m; < 1. In 
addition, the parametric estimates are close to the non-parametric one by Caperaa 
and Fougeres (2000a); see also section 9.4.1. 

We also investigate how closely the data follow the asymptotic tail chain {Y n } 
of the model by comparing the empirical values (10.60) of the transitions with 
their estimated distribution in Figure 10.10. The joint density plot shows that the 
empirical values are negatively correlated, so we would need a higher threshold 
to find the independence structure of the tail chain. On the other hand, the dis¬ 
crepancies between the empirical and model marginal distributions are sufficiently 



Figure 10.9 Estimates of the Pickands dependence function of the bivariate 

Markov model: non-parametric ( - ), logistic ( - ), asymmetric logistic 

(.)and bilogistic ( - )• 
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Figure 10.10 Diagnostic plots for the tail-chain transitions. The joint density plot 
shows the empirical transitions with model contours; the density plot shows the 

model estimate ( - ) and a kernel density estimate ( - ); the probability and 

quantile plots refer to the transitions A\. 

small that an empirical version of the tail chain could be obtained by simulating 
independent transitions from the kernel density estimate in Figure 10.10. 

Extremal characteristics of the fitted logistic model are found from 10000 
simulations of the model tail chain with length r = 100. The extremal index is 
0.54 with bootstrapped 95% confidence interval (0.42, 0.69), the mean cluster size 
is 1.84 (1.45, 2.37) and the mean number of up-crossings per cluster is 1.09 (1.04, 
1.17). The cluster-size distribution is 分 （ 1) = 0.60, ir(2) = 0.20, tt( 3) = 0.10 and 
ft (4) = 0.05. Figure 10.11 exhibits the estimate of the distribution of the aggregate 
cluster excess that deviates from the empirical estimate mainly around 1°C-4°C. 
The Markov model produces clusters that are smaller than, but in general agreement 
with, those found empirically at the same threshold. The choice of parametric model 
in fact has little influence on the extremal characteristics: witness the extremal 
indices from all six models displayed in Table 10.1. 

The estimates of the 100, 1 000 and 10000 July return levels with bootstrapped 
95% confidence intervals are 37.6 (36.3, 39.0), 38.9 (37.0, 41.9) and 39.6 (37.2, 
44.2); the estimated upper end-point is 40.3 (37.2, 53.9). These are larger than 
the estimates from the marginal analysis in section 10.2.2, due principally to the 
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Cluster excess (log scale) 

Figure 10.11 Empirical distribution function ( - ) and estimate from the 

Markov model ( - ) of cluster excess at the 96% threshold. 

different shape parameters, and intimate a deficiency in this Markov model. This 
is in line with findings of Dupuis and Tawn (2001) that misspecification of the 
dependence model may corrupt estimates of marginal parameters. 

We have noted some evidence for asymptotic independence, such as the empir¬ 
ical estimates of the extremal index in Figure 10.5 that increase at high thresholds. 
To assess this evidence more formally, we test rj = 1 ， where rj is the coeffi¬ 
cient of tail dependence (10.58); see also section 9.5. First we transform the 
data X\,..., X n to approximate standard Pareto margins by Z/ = 1 / {1 — F n (Xi)}, 
where F n is the empirical distribution function; an alternative is to transform to 
standard Frechet margins. Next, define 

Ti = min(Z/, Z i+ i) for i g {j : Xj and fall in the same year}. 

In view of (10.58), the tail function of 7} is regularly varying with index rj. Hence, 
if T(\) > T( 2 ) > ... are the Ti in descending order, then Hill’s estimator for r] is 

"=X! log r (0 — log 
1 i=l 

see, for instance, Ledford and Tawn (2003). Values of r/ for different k are repro¬ 
duced in Figure 10.12, with bootstrapped 95% confidence intervals constructed by 
resampling the data blocked by year. The estimates are about 0.8 and are signifi¬ 
cantly less than 1 for all values of k. There is some evidence, therefore, that the 
series is asymptotically independent and we should be wary of extrapolating the 
results obtained from the previous Markov-chain model. 
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Figure 10.12 Hill’s estimates ( - ) of the coefficient of tail dependence, ij, 

against the number of order statistics, with bootstrapped 95% confidence intervals. 


Asymptotic independence can be handled by model (10.57) and supports cluster 
characteristics that can change with threshold. We choose again the symmetric 
logistic model for \4 in (10.59) and find that parameter estimates are stable above 
the 96% threshold, although ao is poorly estimated. At the 96% threshold, fj = 
0.84 and the p-value for the nonregular, likelihood-ratio test ofrj=l (Bortot and 
Tawn 1998) is 0.03, confirming our earlier conclusion of asymptotic independence. 
A likelihood-ratio test does not reject c\ = C 2 so we refit the model with this 
constraint, obtaining a — 2.7 (0.4), y — —0.35 (0.10), c\ = C 2 = 0.59 (0.07), ao = 
0.2 (0.3) and a = 0.53 (0.10). 

The estimates of the extremal index from this model, obtained by simulating 
tail chains of length r = 20 and truncating once the chain falls below the model 
threshold, are reproduced in Figure 10.13. Other cluster characteristics were sim¬ 
ulated too: the mean cluster size decreased from 1.73 at the 96% threshold to 1.47 
by the 99.5% threshold; the mean number of up-crossings per cluster rose from 
1.00 to 1.06; and ir(l) increased from 0.60 to 0.69, which is consistent with the 
empirical estimates in Figure 10.8. 

When the extremal index changes with threshold, return-level estimation is 
improved if the approximation P[M n < x] ^ {F{x)} n0 is used with 0 = 0{x). The 
return levels obtained in this way from our model are 37.1 (36.2, 38.0), 38.1 (36.8, 
40.0) and 38.5 (37.0, 41.4), with upper end-point 38.8 (37.1, 44.7). These are 
close to the return levels estimated from the GEV model, principally because of 
the similar shape parameters. 

This concludes our analysis of the Uccle data. We have found evidence for 
asymptotic independence, which means that cluster characteristics change with 
threshold. Within the data, the empirical estimates of section 10.3.5 provide a 
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Figure 10.13 Extremal index estimates against threshold on complementary log- 

log scale: empirical ( — o — ) with bootstrapped 95% confidence intervals (.) 

and from the asymptotically independent Markov model ( - ). 

valuable description, but if inference is required for levels at which we have no 
data then the asymptotic independent Markov model of this section can be used. 

Return levels from the different models are summarized in Table 10.2. Of the 
marginal models, we prefer the GP model for threshold exceedances to the GEV 
model for block maxima because the estimates are more precise. In section 10.3.5, 
the GP return levels were estimated with 0 = 0.49. In light of asymptotic inde¬ 
pendence, we should use 0 = 1, which yields estimates that are closer to the GEV 
estimates. The asymptotically dependent model is inconsistent with the other results 
because of its larger shape parameter. The asymptotically independent model, how¬ 
ever, produces estimates similar to the GEV estimates and with similar confidence 
intervals. We can conclude with some confidence, therefore, that the point estimates 
from the GEV model are good estimates of the true July return levels. 

Table 10.2 Return levels (°C) with 95% confidence intervals and shape 
parameters from five models: GP with 0 = 0.49, GP; with 0 = 1, GP1; 
GEV; asymptotically independent Markov chain, MCI; asymptotically 
dependent Markov chain, MCD. 


Model 100 1000 10000 y 


GP 

36.5 (35.7, 36.9) 

37.2 (36.2, 38.1) 

37.5 (36.3, 38.7) 

-0.42 

GP1 

36.8 (35.9, 37.2) 

37.3 (36.3, 38.3) 

37.5 (36.4, 38.9) 

-0.42 

GEV 

36.9 (36.2, 38.6) 

37.9 (36.9, 40.5) 

38.3 (37.2, 41.8) 

-0.34 

MCI 

37.1 (36.2, 38.0) 

38.1 (36.8, 40.0) 

38.5 (37.0, 41.4) 

-0.35 

MCD 

37.6 (36.3, 39.0) 

38.9 (37.0, 41.9) 

39.6 (37.2, 44.2) 

-0.30 
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10.5 Multivariate Stationary Processes 

Up to now the setting of this chapter consisted of a univariate stationary time series. 
Complementarily, the framework of Chapters 8 and 9 was that of independent 
multivariate observations. In this section, we join both lines to the study of extremes 
of multivariate stationary time series. Although this area is relatively unexplored, 
some theory is already available, mainly on the vector of component-wise maxima. 
In particular, we shall encounter an appropriate generalization of the extremal limit 
theorem (ELT) in section 10.5.1 and of the extremal index in section 10.5.2. These 
results, however, have so far hardly led to any practical statistical procedures. It is 
our hope, therefore, that the present overview of the theory might stimulate further 
research in the area. 

10.5.1 The extremal limit theorem 

Let X n = (X n ^i ,..., X n ^), n > 1, be a stationary sequence of random vectors in 
W 1 with distribution function F. We seek to model the extremes of the process. A 
natural starting point is the sample maximum, defined as the vector of component¬ 
wise maxima, 


M n = ( max , max 

\i=l,...,n ’ i=l,...,n 

We shall investigate the asymptotic distribution of a~ l (M n — b n ), where a n > 
0 = (0,..., 0) and b n are d-dimensional vectors. By convention, operations on 
and relations between such vectors are to be read component-wise. 

The case of independent vectors X n was treated in Chapter 8. A central problem 
there was to characterize the class of distribution functions G with non-degenerate 
margins that can arise as the limit in 

P[a^ l (M n -b n )<x]^ G(x), n -> oo. (10.61) 

This gave rise to the class of multivariate extreme value distributions that were 
described in detail. In the stationary case now, we shall seek conditions so that 
any limit distribution G in (10.61) must be a variate extreme value distribu¬ 
tion as well. This will provide a proper generalization of the univariate ELT 
(Theorem 10.2). As in the univariate case, the long-range dependence in the process 
will need to be restricted in some way. 

At this stage it pays off to reflect a little on the structure of the arguments in the 
univariate case. Let {X n } be a stationary sequence of univariate random variables 
and recall the notation of section 10.2. For a sequence of thresholds u n consider 
the events A n j = {Xi < u n }. Observe that for fixed n the sequence of indicator 
variables {l(A„ /)}/>i is stationary. 

The crucial step in the proof of Theorem 10.2 is the decomposition (10.3) 
P[M n < u n ] = {P[M rn < w„]}L w / rn 」 +0(1) for a positive integer sequence r n tend¬ 
ing to infinity but at a slower rate than n. It is a useful exercise to rewrite the whole 
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argument leading to (10.3) in terms of the events A n j. Explicitly, for a set I of 
positive integers we can write 


P[M(I) <u n ] = P 


Jel 


_/€/ 


The D(u n ) condition required in the theorem can be expressed in terms of the 
events A n j as well since 


a(n, s) = max max 

\<l<n—s I,J 


^ 0 ^ n,i 

JelUJ 


_ 尸 「) 

-iel 


^ 0 ^ n,i 

JeJ 


(10.62) 


the second maximum ranging over all I c {1，... ， /} and J c ： {I s, n}. 

How does this help us in the multivariate case? Let u n be a sequence of d- 
dimensional thresholds and consider the events A n j = {Xi < u n ], the ordering of 
vectors being component-wise. Clearly, the translated version of the univariate 
argument goes through without change. In particular, define a(n, 5 1 ) as in (10.62) 
and say that Condition D{u n ) holds if a(n, s n ) 0 for some positive integer 
sequence s n such that s n = o(n). We arrive at the multivariate version of the ELT, 
due to Hsing (1989) and Hlisler (1990). 

Theorem 10.22 Let {X n } be a stationary sequence for which there exist sequences 
of constant vectors a n > 0 and b n , and a distribution function G with non-degenerate 
margins such that 


P[a~ l (M n — b n ) < x] ^ G(x), n ^ oo. 

If D(u n ) holds with u n = a n x + b n for each x such that G(x) > 0, then G is a 
d-variate extreme value distribution function. 


The dependence may affect the limiting distribution G in the sense that it can be 
different from the corresponding limit G for the associated, independent sequence 
X n , n > 1, of random vectors with the same marginal distribution as X\. So what 
is the connection between G and G and when are they the same? 

The latter question is the easier one to answer. Condition D r (u n ) holds if 


\n/k\ 

lim limsupn P[X\ ^ u n , Xi ^ u n ] = 0. 

k—oo n ^oo ^ ' 

i=l 

Observe that this is the direct translation of Condition D\u n ) via the A n j. The 
arguments in the univariate case go through here as well: the inclusion-exclusion 
formula and D\u n ) give 


P[Al rn ^ n n ] = r n F{Un) + o{j" n jti) 
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whenever r n = o(n), so that 

P[M n < u n ] = {P[M rn < «„]}L»A»J +o(1) = {F (u n )} n +o{\), 

provided na(n, s n ) = o(r n ) for some s n = o(r n ). We obtain the following result. 

Theorem 10.23 Let G be a d-variate extreme value distribution and let a n > 0 
and b n be d-dimensional vectors such that D(u n ) and D\u n ) hold for every u n = 
a n x + b n with a: G such that G(x) > 0. Then 

P[a~ l (M n - b n ) <x]^ G(x), n oo, 

if and only if 

F n (a n x n -\-b n ) ^ G(x), n oo. 

10.5.2 The multivariate extremal index 

Recall that under the D\u n ) condition the asymptotic distribution of M n is 
the same as in the case of an independent sequence. The reason is that 
the D r (u n ) condition prevents local clustering of extremes, so that the tem¬ 
poral dependence becomes negligible at high-levels. Things become differ¬ 
ent, however, if we allow for local dependence at such high levels as well. 
Whereas in the univariate case, the effect of local dependence was sum¬ 
marized by a single number, the extremal index, the multivariate setting is 
more difficult: the analogue of the extremal index turns out to be a function 
(Nandagopalan 1994; Perfekt 1997; Smith and Weissman 1996). 

Let again {X n } be a stationary sequence of random vectors in W 1 with dis¬ 
tribution function F. Assume that there are vectors a n > 0 and b n and variate 
extreme value distributions G and G such that 

P[a~ l (M n -b n )<x]^ G(x) y 

F n (a n x -\-b n ) ^ G(x), 

as w > oo. Assume also that the jth marginal series {X n j} n has extremal index 
0 < 0j < 1 , so that the margins of G and G are related by Gj(x) = {Gj(x)} e J for 
j = 1, ..., J. The 0j need not be the same, showing that the connection between 
G and G may be more complicated than in the univariate case. We will also need 
the stable tail dependence functions / and l of G and G, defined by 

G(x) = exp[-/{-logGi(xi), … ， 一 log G d (x d )}], 

G(x) = exp[-/{-logGi(xi),.- log G d (x d )}], 

see (8.12). 
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Definition 

To define the multivariate extremal index, it is convenient to make abstraction of 
the margins. For v g [0, oo) \ {0}, let x = x(r) be such that Vj = — log Gj(xj)= 
—0J l log Gj(xj) for j = l,..., d. In case vj = 0, we set Xj = sup{x G M : 
Gj(x) < 1}. Let x n = x n (v) be a sequence in W 1 such that x n x as w > oo 
and let u n = a n x n + b n . Clearly 

lim nP[X\j > u n j] = Vj, j = 1 ， ... ， d, (10.63) 

n^-oo 

together with 

lim P[M n < u n ] = G(x), lim F n (u n ) — G(x). 


Now define the extremal index function, or extremal index in short, of the 
sequence {X n } by 


0(v) 


logG(x) 
log 亡 (x)’ 


i; e [ 0 , oo) \ { 0 }. 


(10.64) 


This is a straightforward extension of the definition in the univariate case 
(Theorem 10.4). In terms of the stable tail dependence functions, we have 


0(v) 


…， OdVd) 


V e [0, oo) \ {0}. 


(10.65) 


Properties 

The multivariate extremal index satisfies a number of properties. 

(i) 0(v) is a continuous function in v. 

(ii) 0(cv) = 0(v) for 0 < c < oo and v e [0, oo) \ {0}. 

(iii) for j = l, d we have 6(ej) = Gj where ej is the 7 th unit vector. 

(iv) 0 <^(.)<1 . 

Properties (i-iii) are immediate consequences of (10.65) and properties of sta¬ 
ble tail dependence functions. To prove (iv), observe first that, with x = jc(v) and 
u n = a n x n + b n as above, 

P[M n < u n ] = 1 - P[M n i u n ] >l-n{l - F(u n )} 


so that 

G(x)= lim P[M n < u n ] > lim [1 - n{l - F{u n )}] = 1 + logG(x), 
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and thus Qxp{—l(0\V\, ..., OdVd)} > 1 — l(v). This inequality and property 
(ii) imply 


0(v) = lim 

54,0 


l(sO\V\, , sG d v d ) 
l(sv) 


<lim~ l0gU - [( ^ )} 

l(sv) 


whence (iv). 

Property (iii) can be extended to a univariate characterization of the multivariate 
extremal index (Smith and Weissman 1996). Consider the random variables 

y,,0) = max _ ^ ~ v e [0, oo) \ {0}. (10.66) 

)=1， …， 6? 1 — r j \\ n j) 

Denoting the quantile function of Fj by Fj~ (p) = inf{x g M : Fj(x) > p] (0 < 
p < 1), we have, assuming for simplicity that Fj is continuous, 


max Yi (v) < n 

= P 

ma 

X Fj(Xi^ 

,) < i - 

V; = 1 ,... 

, d 



?= l ,.. 

..,n 


n 

- 


= P 

Mn，j 

■^ F r{ 

4 ) 

, Vj = 1, ..., J 



—> G(x), n ^ oo, 


by (10.63). Similarly, {P[Y\(v) < n]} n G(x) as n —> oo. Hence 


(v) 0(v) is the (univariate) extremal index of the sequence {7„(v)}. 

Finally, we mention that the multivariate extremal index admits similar inter¬ 
pretations as the univariate one. For instance, under condition D{u n (v)} and for 
suitable integers r n = o(n) we have 0(v) = lim^f (v) = lim^f (v) where 


1 __ r n {\ - F(u n )} _ 

0^) = P[3k=h...,r n :X k ^u n ] 


=E 


r n 

-k=\ 


彐众 = 1,..., r n ! ^ u n 



max Xk < u n 


X\ ^ U n 


The arguments are perfectly analogous to the univariate case and are omitted. In 
effect, the multivariate extremal index summarizes temporal dependence at extreme 
levels, but the strength of dependence can vary with direction. 


Example 10.24 Let Z/, i e Z, be independent, standard Frechet random variables. 
Also, let ajk, j = … ，d and k = 0, 1, 2,... be non-negative constants such that 

hc>Q a jk = 1 for j = 1 ， … ， d. The multivariate moving-maximum process {X n } 
is defined by 


Xn，j = max oij = 1,d. 
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Observe that the margins of X n are standard Frechet, and recall from Example 10.5 
that the marginal extremal indices are 0j = max^>o a^. Let F be the distribution 
function of X n . For v g [0, oo) \ {0}, we have 

, w/uj) = exp I _ y max 
\ k>o j=l -' d 

Similarly to the univariate case, for v e [0, oo) \ {0}, 



P[M n j < ri/vj, Wj = 


exp 


E max max a\^v\ + max 

灸 = 0 , …， J J ^ ir=i^ 


J=0 


/ >0 


max 


a JkVj 


—> exp I — max max ajkVi I , n oo. 
V A:>0 J 


We conclude that the multivariate extremal index of {X n } is 


0(v) = 


max^^o maxj=i ， a, jk v j 
max j=l,.：,d a jk v j 


ve[0, oo)\{0}. 


Estimation 


How to estimate the multivariate extremal index? Observe that the blocks, runs and 
intervals estimators of the univariate extremal index can all be written in terms of 
the indicator variables l(Xk < u). In the multivariate case, then, we can choose 
a vector of thresholds, u, compute v where vj = Y^=i 1 (^',> > Uj) estimates 
vj = nP[X\j > uj], and construct blocks, runs or intervals estimators of 0(v) 
from the indicator variables l(Xi < u), i = 1,..., n. A related method would be 
to first compute Yf(v) (i = 1,... ,n) by plugging in estimates of the unknown Fj 
into (10.66) and next to estimate the (ordinary) extremal index of this sequence. 

Unfortunately, to estimate a function rather than a number is markedly more 
difficult: thresholds need to be chosen for every v, and the point-wise estimates 
0(v) need not necessarily satisfy (i to iv). Up to our knowledge, there is no literature 
yet on estimation of the multivariate extremal index, except for a manuscript of 
Smith and Weissman (1996)，in which a less direct method based on Pickands 
dependence function is proposed. 

10.5.3 Further reading 

The multivariate extremal index was proposed in Nandagopalan (1994). The same 
paper also discusses multivariate extensions of some point-process results in the 
spirit of section 10.3. 
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Smith and Weissman (1996) and Zhang (2002) introduced a class of pro¬ 
cesses called multivariate maxima of moving maxima, or M 4 in short. These pro¬ 
cesses constitute a generalization of the multivariate moving-maximum processes 
of Example 10.24. The multivariate extremal indices of M 4 processes turn out to 
form a rich subclass of those of general multivariate stationary processes. In this 
sense, the problem of modelling extremes of multivariate stationary processes can 
be stylized to the study of extremes of M\ processes. 

Extremes of multivariate Markov chains are treated in Perfekt (1997). The mul¬ 
tivariate extremal index is studied first for general multivariate stationary processes 
and next for multivariate Markov chains, with special attention to a multivariate 
version of the tail chain. 

A few declustering schemes have been proposed for multivariate sequences 
(Coles and Tawn 1991; Nadarajah 2001). These schemes are designed to extract 
independent observations from a multivariate, stationary sequence: clusters are 
identified and then summarized by a single value, such as the component-wise 
maximum of the observations in the cluster. The approach of Coles and Tawn 
(1991) is a multivariate version of blocks declustering; that of Nadarajah (2001) is 
a complicated extension of runs declustering. Both methods require the choice of 
one or more declustering parameters. The intervals declustering scheme (Ferro and 
Segers 2003) can be applied without arbitrary choice of declustering parameters 
by considering the return times to a ‘failure set’，membership of which defines 
an observation as extreme. Such a general formulation, already alluded to by 
Nandagopalan (1994), is developed in Segers (2002). 


10.6 Additional Topics 


Heavy-tailed time series 


Efforts to model financial time series have led to the development of various time- 
series models, extending the classical framework of linear processes (Brockwell 
and Davis 1991) 


X t = ^ fZ t —“ t eZ, 

i=l 

in particular, of auto-regressive moving-average (ARMA) processes; here the inno¬ 
vations Z t are independent, identically distributed with finite second moment, while 
the parameters 於 / satisfy a certain summability constraint. Deficiencies of these 
ARMA processes are that they do not satisfactorily model the more extreme obser¬ 
vations of financial time series with respect to both the magnitude and the serial 
dependence of such extremes. For a financial risk manager, such shortcomings are 
particularly grave because the financial risk involved in holding a certain portfolio 
may be underestimated. 
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A natural extension of the classical framework is to allow the innovations Z t 
to be heavy-tailed, leading to heavy-tailed linear time series. Extremal character¬ 
istics of such processes, like the extreme value index, the extremal index, and the 
limiting distribution of clusters of extremes, can be expressed in terms of the tail 
of the innovation distribution and the parameters 杯 Moreover, for ARMA(p, q) 
processes 

p q 

中 i X t-i + 

，=1 7=1 

with innovation distribution in the domain of attraction of a stable distribution, it 
is known how to estimate the coefficients 0 / and Oj (Mikosch et al. 1995). This 
allows reconstruction of the innovations, leading, after estimation of the innovation 
distribution, to estimates of characteristics of clusters of extremes. A recommend- 
able overview with numerous references of extreme value theory for heavy-tailed 
linear time series is Chapter 7 of Embrechts et al. (1997). 

Particularly popular in finance are the auto-regressive conditionally het- 
eroscedastic (ARCH) process (Engle 1982) and its numerous ramifications, in 
particular, generalized ARCH or GARCH (Bollerslev 1986). Not surprisingly, 
their extremal properties have been thoroughly investigated (Basrak et al. 2002; 
Borkovec 2000; Borkovec and Kliippelberg 2003; de Haan et al. 1989; Mikosch 
and Starica 2000), even for multivariate versions (Starica 1999). 

Finally, replacing sums by maxima in the definition of linear time series and 
requiring the innovation distribution to be Frechet leads to max-stable processes, 
in particular, max-ARMA processes, of which the ARM AX and moving-maximum 
processes considered in this chapter are special cases. The probability theory for 
such processes is well developed (Alpuim 1989; Alpuim et al. 1995; Davis and 
Resnick 1989, 1993; Deheuvels 1983; de Haan 1984; de Haan and Pickands 1986), 
although statistical applications have appeared only recently (Hall et al. 2002; 
Zhang 2002; Zhang and Smith 2001). 

Tail estimation for the marginal distribution 

How to estimate the tail of the marginal distribution of a random sample was the 
topic of Chapters 4 and 5. Unfortunately, the assumption of independence is all 
too often not very reasonable: hot summer days group together in heat waves, 
and large positive or negative returns of financial assets occur in periods of high 
volatility. Two questions arise: Are these estimation procedures still applicable? 
And what is the effect of dependence on estimation uncertainty? 

The answer to the first question is affirmative: all familiar tail estimators, be it the 
Hill estimator (Hill 1975) or the maximum likelihood estimator in the POT model 
(Smith 1987) or indeed any other estimator, are consistent and even asymptotically 
normal provided the dependence between observations that are far apart in time is 
small. The second question, unfortunately, is more difficult to answer. Still, we can 
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assert that typically, the effect of dependence is to increase the asymptotic variances 
of tail estimators, although it is not easy to say by how much. In particular, confidence 
intervals based on theory for independent variables risk being too narrow. 

Broadly speaking, two strategies are conceivable: (1) Proceed with estimation 
as if the data were independent, but adapt the standard errors; (2) Extract from the 
original sample a new, approximately independent series, on which the inference 
procedures can then be applied as usual. The simplest example of the second 
strategy is the method of annual maxima, in which data are grouped in blocks 
and a GEV distribution is fitted to the block maxima. Recall from section 10.2 
that under D{u n ) type conditions such block maxima are indeed approximately 
independent. Alternatively, in the POT method we fit a GP distribution not to 
all excesses over a high threshold but only to the cluster maxima, a procedure 
motivated by the point process results of Section 10.3. 

Which of the two strategies is the better one depends on the model assumptions 
one is willing to make, perhaps motivated by the problem at hand. In general, 
the more information one has about the model, the easier it becomes to extract 
approximately independent residuals, and the more successful will the second 
method become. For instance, Resnick and Starica (1997) considered an auto¬ 
regressive model 

p 

X t = J2^ x <-'+ z ^ ^ eZ, 

i=l 

with independent, identically distributed innovations Z t with positive extreme value 
index y. They showed that to estimate y with the Hill estimator on the sample 
X\,..., X n is inferior to first estimating the coefficients 0/ (for instance, as in 
Mikosch et al. (1995)) and second, applying the Hill estimator to the estimated 
residuals Z t = X t — Xlf =1 </> / X t -i , the latter procedure attaining the efficiency of 
the case of independent data. Similarly, when studying extremes of a financial 
return series, McNeil and Frey (2000) propose to fit a GARCH model to the series 
and apply standard tail estimators to the estimated innovation sequence. 

However, if there is no clear indication as to which model to use, basically the 
only approximately independent series to be extracted are, as mentioned already, 
block maxima or peaks over high thresholds. In both cases, potentially useful infor¬ 
mation is thrown away, rendering these methods less attractive. A more promising 
road then is to apply an appropriate estimator directly to the data and estimate 
its asymptotic variance. This presupposes that the asymptotic distribution of the 
estimator is known for dependent data as well. 

Not surprisingly, the first tail estimator for which this program was carried 
out is the classical Hill estimator. Hsing (1991) proved asymptotic normality of 
the Hill estimator for stationary sequences satisfying certain mixing conditions 
and gave explicit estimators for its asymptotic variance. Also Re snick and Starica 
(1995, 1998) gave general consistency results, with specializations to various spe¬ 
cific models such as infinite order moving averages, bilinear processes, solutions 



428 


EXTREMES OF STATIONARY TIME SERIES 


of stochastic difference equations, and hidden semi-Markov models. Related to the 
Hill estimator is the ratio estimator (Goldie and Smith 1987), which was investi¬ 
gated in the setting of dependent variables by Novak (1999). 

Unfortunately, all these methods are somewhat ad hoc in the sense that it is 
not clear how to generalize them to other estimators like, for instance, the popular 
maximum-likelihood estimator for the GP distribution fitted to excesses over a 
high threshold. A real breakthrough was achieved by Drees (2000, 2002, 2003). 
He established powerful convergence results for tail empirical quantile processes 
for certain stationary time series. Since most tail estimators can be written as 
smooth functionals of such processes, the classical delta-method immediately leads 
to asymptotic normality for a wide variety of estimators of the extreme value 
index and high quantiles. Moreover, the resulting expressions for the asymptotic 
variance lend themselves to data-driven methods for the construction of confidence 
intervals, the actual coverage probability of which improves considerably upon that 
of intervals constructed under the (false) assumption of independence. 

Still, these methods deal only with the problem of estimating the marginal 
tail. But often, it is also the aggregate effect of extreme observations occurring 
one after the other that is of interest: although a single day with a large amount 
of rainfall may not cause much trouble, the succession of several such days defi¬ 
nitely will. Therefore, we need to estimate appropriate summaries of the strength of 
temporal dependence as well. To assess the uncertainty on estimates of these sum¬ 
maries together with the marginal tail, we have in this chapter relied on bootstrap 
techniques motivated by point-process theory. 

Non-stationary processes 

In this chapter, we have relaxed the assumption of independent, identically dis¬ 
tributed random variables to that of a stationary sequence. In practice, however, data 
are seldom stationary: meteorological data typically have a strong seasonal compo¬ 
nent, tick-by-tick financial data exhibit a clear daily pattern, while macro-economic 
data often show an upward or downward trend. For the Uccle temperature data, 
our solution, which was, by the way, only partially successful, was to extract from 
the whole series the July data. In other applications, however, the non-stationarity 
itself of extremes may be of interest. This was treated in Chapter 7 in case there 
is no serial dependence. 

Exceedances of a non-stationary sequence X\, X 2 , ... above a boundary func¬ 
tion u n ， i, u n ,2, … define a point process, 

Nni") = 〉 ： ^i/n(.'\ X = {i ' Xf > U n i, 1 S S w}. 

ieX 

Like in the stationary case (Section 10.3), N n converges, under mild mixing condi¬ 
tions and assumptions on the marginal distributions, to a certain compound Poisson 
process (Hiisler 1993; Hlisler and Schmidt 1996). This result hints at the possibility 
of extending regression analysis for extremes to allow for serial dependence and 
clustering. 
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BAYESIAN METHODOLOGY 
IN EXTREME VALUE 
STATISTICS 

co-authored by Daan de Waal 
11.1 Introduction 

The Bayesian paradigm provides a set of interesting additional statistical tools when 
carrying out an extreme value analysis. There are several good reasons for that. 


• Given the low amount of information often available in extreme value anal¬ 
ysis, it is natural to consider other sources of knowledge; these can occur in 
the form of known constraints, whether from physical, economical or other 
origin. For instance, an economist may want to specify a maximum value for 
a quantity or variable under study. There are, however, several other possible 
ways in which an expert with knowledge of the processes behind the data 
may deliver information that is relevant to extremal behaviour and which is 
independent of the available data. 


• Prediction is also naturally incorporated in a Bayesian setting. The concept 
of posterior prediction matches with the fact that the principal inferential 
objective of an extreme value analysis is of predictive nature. 

• Bayesian analysis is not dependent on regularity assumptions required by, for 
instance, the maximum likelihood and probability weighted moments methods. 
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As with the Pickands type estimators, the moment estimator and others dis¬ 
cussed in Chapter 5, Bayesian inference provides a viable alternative in cases 
when maximum likelihood and probability weighted moments break down. 

At the other side, many statisticians argue that the problem of prior elicitation leads 
to subjectiveness. Without taking part in the discussion for and against Bayesian 
methodology, we aim at showing that a practical statistical analysis can indeed 
gain from this approach. Some important contributions to this subject are found in 
Pickands (1994), Coles and Powell (1996), Coles and Tawn (1996b), Smith (2000 )， 
Smith and Goodman (2000) and Coles (2001) (Chapter 9). Bayesian inference of 
extremes has only quite recently been discovered because of the availability of 
Markov Chain Monte Carlo (MCMC) techniques. These computer-intensive meth¬ 
ods have opened up the field of extremes to complicated settings involving large 
parameter sets. So the methods described in this chapter appear to be alternatives 
that are of full value or even preferable to more conventional ones. 

We will briefly review some of the basic characteristics of a Bayesian analysis 
here. Then we go over the statistical problems raised in Part I (see Chapters 4 and 
5), to end with a more complex application from environmetrics. 


11.2 The Bayes Approach 


Let y = (yi,.. •, y m ) denote the observed data of a random variable Y distributed 
according to a distribution with density function f(y\0). For instance, y can rep¬ 
resent a random sample of m independent observations consisting of maxima. 0 
denotes the vector of parameters. Let Jt(0) denote the density of the prior distri¬ 
bution for 0. We write the likelihood for 0 as f(y\0), which equals 


in case of independence. According to Bayes’ theorem. 

f(y\O)n(0) 


^(0\y) 


f n f(y\e)n(0)d0 


oc f(y\0)7t(0). 


( 11 . 1 ) 


where the integral is taken over the parameter space ^2. This well-known prob¬ 
abilistic result provides a framework for statisticians to convert an initial set of 
beliefs about 0, represented by the prior ic(0), into a posterior distribution 7t(0\y) 
of 0 that is proportional to the product of the likelihood and the prior. Estimates of 
0 will then be obtained through the mode or mean of the posterior, while the accu¬ 
racy of an inference is described by the posterior distribution itself, for instance, 
through a highest posterior density (hpd) region according to a certain probability 
1 — a, which is the region of values that contains 100(1 — a)% of the posterior 
probability and also has the characteristic that the density within the region is never 
lower than that outside. Here, there is no need to fall back to asymptotic theory. 

Ease of prediction is another attractive characteristic of the Bayesian approach. 
If F m+ i denotes a future observation with density function f(y m +il^), then the 
posterior predictive density of a future observation Y m+ \ given y is given by 

/(jm+il_V) = f f(y m +i\O)7t(0\y)d0. (11.2) 

Jq 



BAYESIAN METHODOLOGY IN EXTREME VALUE STATISTICS 


431 


Compared to other approaches to prediction, the predictive density has the advantage 
that it reflects uncertainty in the model through rc(0\y) and uncertainty due to 
variability in future observations through /(y m +i | 汐 ）. The posterior predictive prob¬ 
ability of 7„ + i exceeding some high threshold y is accordingly given by 

PiXm+x > y\y) = f P(y m +i > y\0)n{0\y)d0. (11.3) 

Jo. 

The posterior predictive distribution (11.3) most of the time is difficult to obtain 
analytically. However, it can be approximated if the posterior distribution has been 
estimated by simulation as discussed further on. Given a sample , 0 r from 

n{0\y), then we can use the approximation 

I r 

P(Y m ^i 〜 >y\0i), (11.4) 

r 

i=\ 

where P{Y m+ \ > y\0i) follows immediately from the postulated density function 
f(y\0). A posterior predictive (1 — p) quantile is obtained by solving 

P(Y m+l > y|^) = p. (11.5) 

Most often, this solution cannot be found analytically, and then the solution y of 
(11.5) can be found using a standard numerical solver. 


11.3 Prior Elicitation 

The main objection against the use of Bayesian analysis is the need for spec¬ 
ifying a prior n(0). When available information is minimal, one can start an 
updating scheme with an objective prior distribution. Uniform priors are the sim¬ 
plest examples of this kind. Other proposals, for instance, are Jeffreys’ prior and 
the maximal data information (MDI) prior. Advantages of using objective prior 
distributions are found in the fact that objective priors are sometimes used as a 
benchmark that will not reflect the particular biases of the analyst and that the use 
of such priors will yield statistical procedures that are analogous to those devel¬ 
oped using classical (frequentist) procedures. In multiple parameter situations, the 
parameters should not be taken to be independent, which is sometimes the case 
with objective priors. Another point of concern is the invariance under certain 
groups of transformations and different parametrizations. 


Jeffreys 5 prior 

Jeffreys’ prior (Jeffreys (1961)) is defined as J(0) oc y/\I(0)\ where 1(0) is 
Fisher’s information matrix with (/, y)-th element 


lij (❹ ）=E 


a 2 log/_ j 

~dOidOj ~ J 


i，j = i ， ... ， p ， 
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where p denotes the dimension of 0. Jeffreys’ prior is considered to be the standard 
starting rule for an objective Bayesian analysis. It is invariant under one-to-one 
transformations and takes the dependence between the parameters into account. 
When applied to the models appearing in extreme value methodology, Jeffreys’ 
prior leads to the same restrictions on the parameter set as with the maximum 
likelihood approach. See, for instance, Bernardo and Smith (1994), Chapter 4, for 
more details. 

MDI prior 

Zellner (1971) defined the MDI prior to provide maximal average data informa¬ 
tion on 0. These priors are not invariant under reparametrization, but are usu¬ 
ally easy to implement, however. The MDI prior for 0 is defined as 7t(0) oc 
exp E {log f(Y\0)}. Constraints on the parameters can be built into the prior. We 
refer to Zellner (1971) for more details. 

On the other hand, subjective prior distributions represent an attempt to bring 
prior knowledge about the phenomenon under study into the problem. This always 
leads to proper priors, which means that they integrate to 1, and these priors are 
typically well behaving analytically. However, it is not always easy to translate 
the prior knowledge into a meaningful probability distribution. Also, the results of 
a Bayesian analysis that used a subjective prior are meaningful to the particular 
analyst whose prior distribution was used, but not necessarily to other researchers. 
Families of subjective distributions are the natural conjugate families, exponential 
power distributions and mixture prior distributions. Natural conjugates are most 
popular possibly due to their mathematical convenience: it is the class of distribu¬ 
tions that is closed under transformation from prior to posterior; that is, the implied 
posterior distribution with respect to a natural conjugate prior is in the same family 
as the prior distribution. 

Specifically in an extreme value context, authors have rather systematically 
advocated the specification of priors in terms of extreme quantiles of the underly¬ 
ing process rather than the extreme value model parameters themselves, see, for 
instance, Coles and Tawn (1996b). Of course, subject to self-consistency, a prior 
distribution on a set of two or three parameters can always be transformed to a 
prior distribution on the original model parameters themselves. An example in 
insurance comes from the fact that finite right end-points are sometimes specified 
to loss distributions, while the claim data appear to be of Pareto-type on the basis 
of the data analytic methods described in the first part of this book. Alternatively, 
in some specific contents, a prior can be designed so that the analysis can meet 
requirements set by experts. A well-known example of this kind is the requirement 
in reinsurance applications that the EVI y should not to be larger than 1, or even 
0.5, for the most common premium calculation methods to be valid. We will give 
some examples of this kind using conjugate priors, but mostly we will restrict 
ourselves to the use of objective priors. In this way, we hope to convince more 
people of the possible added value of a Bayesian approach to an extreme value 
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analysis. Of course, the question of sensitivity of tail estimations to changes in 
prior specification should be posed. 


11.4 Bayesian Computation 

The main obstacle to the widespread use of Bayesian techniques for a long time 
was the difficulty of computation of the different integrals involved with posterior 
inference. Appropriate choice of prior families for certain models can avoid the 
necessity to calculate, for instance, the normalizing integral in (11.1) but this is 
rather exceptional, certainly in multi-parameter settings. This difficulty has been 
lifted with the development of simulation-based techniques. MCMC methods have 
popularized the use of Bayesian techniques to a great extent. We discuss here briefly 
two of the more popular MCMC methods: the Gibbs sampler and the Metropolis- 
Hastings algorithm. More details can, for instance, be found in Chapter 5 of Carlin 
and Louis (2000). 


The Metropolis-Hastings algorithm 

The basic idea of the Metropolis-Hastings algorithm is particularly simple. One 
simulates a sequence H ... in the following way: starting with an initial point 
0\, the next state 沒 ( i+1 ) is chosen by first sampling a candidate point 0* from 
a proposal density q(0*\0^) that depends on the current state 沒 ⑴. Examples of 
proposal densities can be the multivariate normal distribution with mean 0 、 1 、and 
a suitable chosen covariance matrix. The candidate 0* is accepted with probability 
oti where 


n(e*\y)g{0^\e*) 1 

7v(0 (i) \y)q(0*\0 (i) y I 


( 11 . 6 ) 


If the candidate is accepted, the next state becomes 0( i+1> = 0*, otherwise the 
chain remains at 6^ +1) — 6^. Both rejection and acceptance count as an iteration 
of the algorithm. When a candidate is sampled for which the posterior (11.1) 
is 0, we must continue sampling until we have a candidate with f(y\0*) > 0. 
Remarkably, under some regularity conditions, the stationary distribution is exactly 
the posterior distribution ir(0\y), called the target distribution of the Markov chain. 
Although the proposal density can be arbitrarily chosen, the convergence largely 
depends on the proposal density. On the one hand, a proposal density with large 
jumps to places far from the support of the posterior has low acceptance rate and 
causes the Markov chain to stand still most of the time. On the other hand, a 
proposal density with small jumps and high acceptance rate may cause the chain 
to move slowly and to get stuck in one state. A great advantage, though, of this 
algorithm is that it only depends on the posterior density through ratios of the form 
7 r (^*lj)/ 7r (^ ( ， ) lj)- Hence, the posterior density only needs to be known up to a 
proportionality constant. 
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The Gibbs sampler 

The Gibbs sampler, introduced by Geman and Geman (1984), is an alternating con¬ 
ditional MCMC algorithm. It can be regarded as a special case of the Metropolis- 
Hastings algorithm as follows. Consider a parametric vector 0 = ( 0 \,, 0 P ). 
One iteration of the Gibbs sampler consists of cycling through the p coordinates 
0\,... ,0 P by drawing a sample of one coordinate conditional on the values of all 
the others. For every iteration step, say i, there are p steps. With a predetermined 
ordering of the p coordinates, each 0^ is sampled from the conditional distribution 
given all the other components of 0 . So given a set of values {^f),... ， 々 )}，the 
algorithm proceeds as follows: 


Draw ^ +1) 广 

- 7r(0i|^°,.. 

■ ,0p\y) 

Draw ^ 2 +1) ^ 

- 7T(_f +1 )， 

,..., 0 ( ；\ y) 

Draw 吆 + 1 ) 广 

◊ 7T(d p \el i+l) , 



It can be proven that the corresponding acceptance probability is equal to unity 
so that every jump is therefore accepted for Gibbs sampling. We also mention 
the possibility of Gibbs sampling combined with some Metropolis steps; see, for 
instance, Chapter 11 in Gelman et al ( 1995 ). 

11.5 Univariate Inference 

In this section, we revisit the most important models considered in Part I, namely, 
the fit of the GEV based on block maxima, followed by the different methods 
considering peaks over threshold data. At the end of the chapter, we also consider 
some extensions of these basic models that can be used to provide good global fits 
in addition to appropriate tail fits. 

11.5.1 Inference based on block maxima 

To illustrate the use of the Bayesian methodology described above when block 
maxima are available, we consider again the annual maximal discharges of the 
Meuse river in Belgium, which were already considered in sections 2.2 and 5.1. 
The likelihood model is 

Yi\cr, y，M 〜 GEV(cr, y, /x), / = 1 ， …， 85, 

where r，denotes the maximum for the year indexed by i. Here, 0 = (cr, y, /x) and 

/W = i(l + K^y 1 /y "exp(-(l + K ^ 
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As a prior distribution, we choose here the MDI prior 

7t(0) = exp E {log f(Y\0)} a 二 - 少⑴ ( 1 + z) 

a 

where yjr{\) denotes Euler’s constant. Jeffreys’ prior in case of the GEV is quite 
complicated and only exists when y > —0.5. 

We use the Metropolis—Hastings algorithm with the proposal density q taken as 
a multivariate normal on (log a, y) with independent components and respective 
standard deviations ( 0 . 01 , 10 , 0 . 01 ): 

log a* = loga (i) + O.Ol^i 
= M (0 + 10e 2 
Y* = y (0 +0.01e 3 


where (^i, € 2 , € 3 ) denote independent standard normally distributed random vari¬ 
ables. The values of the variances in the specification of q were chosen after a 
little trial and error in order to make the algorithm work more efficient. Initializ¬ 
ing with (cr, /i, y) = (500, 1200, 0.01), the values generated by 15,000 iterations 
of the chain are plotted in Figure 11.1. The convergence can be speeded up if 
instead q is taken to be a trivariate normal distribution with the covariance matrix 
determined by the information matrix from the log-posterior. In Figure 11.2, we 
show the estimated posterior densities of the GEV parameters (in the original scale 
for cr) and the 100-year return level qy,0.01 • The estimated posterior density of the 
100-year return level is obtained from the GEV quantile function 

qr.p = M + ^ [(- log(l - p))~ Y - 1 ] 

replacing cr, y and /x by their respective posterior realizations. The mean posterior 
estimates together with the 95% hpd confidence regions are given by 


A = 1264 (1156, 1385), a = 471 (400, 547), y = -0.075 (—0.200, 0.072), 

如 01 = 3109 (2711,3809). 

In Figure 11.3, we show the estimated posterior predictive distribution of a future 
observation given y and the corresponding posterior predictive 0.99 quantile. 
These estimates are obtained along (11.4) and (11.5) respectively. 

11.5.2 Inference for Frechet-Pareto-type models 

As in the first part of the book, we again consider the estimation of extreme events 
within the Frechet-Pareto-type framework, that is, 1 — F(x) = x~ x ^ Y ip{x) with 
ip some slowly varying function at infinity. Recall that here there are mainly two 
approaches possible. 




















































Iteration 

(d) 

relative excesses Yj = X/t (X > t) for some appro¬ 
priate tnresnoia t and tit a strict Pareto distribution with distribution function 
1 — y~ x ' Y (y > 1). As with Hill’s estimator for y, we choose t = X n —jc, n so that 
the ordered excesses are given by Yj = X n -j+\, n /X n —k, n , j = l,..., k. Here, the 
likelihood model is given by 

Yily ~ Pa(l/y), j = \ ， … ， k. 
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1100 1200 1300 1400 

mu 

(b) 

Figure 11.2 Annual maximal river discharges of the Meuse: estimated posterior 
density of (a) cr, (b) /x, (c) y and (d) qy,o.oi • 

for some fixed k. So 0 = y and 

f(y\y) = -y~ l ~ l/Y , y > i- 

K 

Here, Jeffreys’ prior turns out to be particularly simple, namely, jc(y) oc l/y, while 
the MDI prior is proportional to (1/y) exp(—y). Continuing with Jeffreys’ prior, 
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q_0.01 

(d) 

the posterior is given by 

7r(j/|30 oc Y~ k ~ l n k j=l yJ l ^ 1/y 
oce~y I ： k j = 1 lo ^j Y - k - 1 




1000 2000 3000 

Annual maximum 


4000 


Figure 11.3 Annual maximal river discharges of the Meuse: P(Y m+ \ < y m ^\ \y) 
and ^y m+1 ,o.oi- 


leading to the posterior mode estimator 


k + 


\ k i k 

—^]logF 7 - = [(log ) +1 ，„ 一 log X, 卜 k ，„) 


k+ 1 


H k , n 


which is almost identical to the Hill estimator itself. Remark that when normalizing 
the posterior 7t{y\y), one obtains 




(11.7) 


which is an inverse gamma distribution. 

The posterior predictive density (11.2) of a future excess Y when using Jeffreys’ 
prior is given by 




k poo 


ikH k ，„) 

y(k-l)\J 0 


v k e -(^sy+kH k , n )w dw 


k(kHtn) (logy + kH k , n y k - 1 
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where, in the second step, we use the substitution w = 
posterior predictive density of log F =: V is given by 


\/y. This entails that the 


f(v\y) 




which turns out to be a GP distribution with scale equal to Hill’s statistic Hk n and 
shape equal to l/k. This then leads to an interesting objective Bayesian alternative 
to the Weissman estimator (see section 4.6.1) for small tail probabilities of Pareto- 
type distributions through (11.3): 


P(X>x\y) 


log(x/ 


The results of this estimator in case of the Secura Belgian Re insurance data set 
introduced in section 1.3.3 (i) with x = 10,000,000 are plotted in Figure 11.4 as 
a function of k together with the result of the original Weissman estimator. 

Simulating /-values from the inverse gamma distribution (11.7) and substi¬ 
tuting them in the expression (k/n)(x/X n ^k yn )~ l ^ Y as suggested in (11.4) leads 
to an alternative approach and yields the possibility of calculating a 95% hpd 
region. The results at k = 95 are shown in Figure 11.5. The 95% hpd region is 
(0.00041,0.00208). 


p 

o 

0 100 200 300 

k 

Figure 11.4 Secura data: P(X > 10 Mio| j) (solid line) and w 他 0 (broken 
line) as a function of k. 
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P 

(b) 

Figure 11.5 Secura data: simulated posterior density of (a) y and (b) 
(95/371) ( 10,000, 000/2, 580, 025)—W. 

Setting (11.8) equal to p and solving for x leads to the estimator 

qx, P = X n — k ' n e kH k ， 圳矿 yk _” (11.9) 

of extreme posterior predictive quantiles. 





BAYESIAN METHODOLOGY IN EXTREME VALUE STATISTICS 


443 


| _ | _ | _ | _ ； 1 

0.002 0.004 0.006 0.008 0.010 

P 

Figure 11.6 Secura data: qx, p (broken line) and posterior median and 95% hpd 
region of Q{\ — p) (solid lines) at A ： = 95 as a function of p. 

Besides the posterior predictive quantile function qx , p ， interest is often in the 
posterior distribution of the quantile function Q(l — p) associated with f(y\0). 
Let us consider the estimation of <2(1 — p) with p g [0.001 ， 0.01] in the insurance 
example using the threshold t = xxi6,?>i\ or equivalently k = 95. Substituting y[- 
values obtained by MCMC from the posterior into the expression t(np/k)~ y allows 
to construct a 95% hpd region for Q(l — p). In Figure 11.6, we show qx, p (broken 
line) and the posterior median and 95% hpd region of Q(1 — p) (solid lines) at 
^ = 95 as a function of p. 

From a subjective Bayesian point of view, inverse gamma priors provide more 
possibilities to incorporate an expert’s view. Inspired by Hsieh (2001), we consider 
here a three-parameter inverse gamma prior IG(X, r], x) as a prior for y, defined by 

= 7 -— 一 ” (K — 1 一 r)"—V— 2 , 0 < y < 1 /r, ( 11 . 10 ) 

r ⑻ 

with 入 ， " > 0 and r > 0. In case r = 0, we obtain back the classical inverse gamma 
distribution. The truncation parameter r can be used to bound the possible values 
of y. For instance, in insurance applications, the value r = 1 is an appropriate 
choice, since values y > 1 (or rv’s X possessing infinite mean) are not acceptable 
to (most) actuaries. Some would even argue for y < 1 /2 and hence r = 2, since in 
many insurance branches, variances are believed to be finite. The parameters 入 and 
rj together can be used to reflect the degree of uncertainty of an expert concerning 
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the specification of y. Further, the prior mode of y is given by 

^ 人 T T] \ — D 


4r 


where 


D = yj (入 r + n + l) 2 — 8 入 r. 
Now the posterior for y is given by 


^(y\y) ocy '^e •… K ， nir e 


—r) / — 


oc y— k — 2 e—(\ +kHk ， n )( Y — — T )(y 


'—r)n. 


( 11 . 11 ) 


The posterior mode can be found analytically: 


2r 


k + 1 


[r (入 + kHk^n) r] — — D 


(11-12) 


where 


^ = y (1 + 众 + 2[ r (入 + kHk，n) + " _ l]) 2 - + 2( 入 + kHic， n )). 

The Bayesian estimator (11.12) of a positive EVI y constitutes an interesting 
alternative compared to the Hill estimator. 

The exact normalization of the posterior (11.11) can be found using the sub¬ 
stitution y~ l — r = u 


-l/r 


y- k - 2 e -( x + kH k,n)(y~ - T \y~ l - rY^dy 


(m + 


Jo 

k 

E 

7=0 

k 

E 


r k ~ j f u r]+j ~ l e~ ( ' MHk ^ n)u du 

Jo 


k h 


Tirj + jXX + kH^)-^. 


Since the m-th inverse moment of a random variable with density given by (11.10) 
equals 


E(r~ m (X, rj, r)):= 


E| 


r( ^ to 


x m - J T{r 1 + 
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the posterior is given by 


^(y\y) 


(入 + kH k ， n y> 


V(r,)E(.r-^ + kH kjn ,r,,T)) 


y- k - 2 e~^ 


—r)(A.+^/4„) / v —1 _ 




The posterior predictive distribution of a future excess Y with a bounded gen¬ 
eralized inverse gamma prior (11.10), is found to be, using again the substitution 
y~ x — r = u 


/* 1/r -i 

f(y\y) = / K _ F 1_y ^(y\y)dy 

JO 


-T—l 


£>( r -( 灸 +I)( l0g ~ + 入 + kH kn , T], r)) /log j + 入 + kH k ,、 


E(r-Hk + kH k ， n ，rj,T)) 


入 + kHk n 


E 畀 I 


k +1 


V k+l~j r(rl + j )(log y + X + kH k ,„) 寸 




T k -Jr(r] + j)(x + 


while 


k ( x 

P(X>x\y) = - 1 


^ \ ^n—k,n / 


k 


r- j r(r, + 7)(log(^) + 入 + kH Kn )i 


T.U 


r-jV{r) + j){X + kH k ，„)_ r >- 


(11.13) 


We apply (11.11) and (11.13) with x = 10 Mio Euro to the insurance data from 
section 1.3.3 (i) with r = 1 and r = 2, and (X, ij) = (8, 4), see Figure 11.7. The 
95% hpd region for y is (0.31489, 0.40361) in case r = 1 and (0.30110,0.38417) in 
case r = 2. The corresponding hpd regions for 95/371 (10,000,000/2,580,025) _1 / K 
are respectively (0.00155, 0.00526) and (0.00121, 0.00422). Note that the posterior 
mode of y is slightly larger than the Hill estimate obtained in section 6.2 (-/i/ 95,371 = 
0.27109), which can be understood from the values of the prior modes, y = 0.68826 
when (A, r], x) = (8, 4, 1), respectively y = 0.41352 when ( 入， " ， r) = (8, 4, 2). 


11.5.3 Inference for all domains of attractions 

Considering now the estimation of extreme events within a general extreme value 
(GEV) context as discussed in Chapter 5， and hence considering GP fits to the 
excesses Yj = Xi — t (X/ > t) for some appropriate threshold t. Choosing t again 
in an observation X n —k , n ，then the model 

/ y \-!/y 

F{y\o,y) = 1 - （1 + y-J 
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Figure 11.7 Secura data: simulated posterior density of y for (a) r = 1 and (b) 
r = 2 and simulated posterior density of 95/371(10, 000, 000/2, 580, 025) _1 / y for 
(c) r = 1 and (d) r = 2. 
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(c) 
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(d) 


0.008 


0.010 


is fitted to the excesses Yj = — X n —k ， n ， j = l,k. Jeffreys’ prior for 


the GP distribution y) 


a(l + y)^/\ + 2y 
given in Smith (1984). The MDI prior, however, is given by 


， which is finite for y > 


is 
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(d) 

Using the MDI prior in the insurance example with t set at the 96th-largest 
observation, and setting up the Metropolis-Hastings algorithm with the proposal 
density q taken as a multivariate normal on (log a, y) with independent components 
and respective standard deviations (0.04, 0.04), that is, 

log a* = log a ⑴ + 0.04^1 
v* = v (0 +0.04e 2 
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Figure 11.10 Secura data: P(Y < v| v) as a function of y at A: = 95. 


where (€ 1 ， € 2 ) denote independent standard normally distributed random variables, 
we obtain the estimated posterior densities of o and y given in Figure 11.8(c) 
and (d) respectively. Initializing with (cr, y) = (570000, 1) the values generated 
by 10000 iterations of the chain are plotted in Figure 11.8(a) and (b). The mean 
posterior estimates together with the 95% hpd confidence regions are given by 

a = 664, 221.3 (449, 704.8; 917, 847.3), y = 0.32339 (0.07401, 0.68992), 

The posterior distribution of the quantile function of the original X-distribution 
can be simulated on the basis of 

q P = t+ -((np/ky y - 1 ) 

y 

replacing a and y by their respective posterior realizations. In the insurance 
example with t set at the 96th-largest observation and p = 0.01，we obtain Figure 
11.9. The posterior predictive distribution of a future claim given the past claims 
is given in Figure 11.10. 

In a hydrological context, Coles and Tawn (1996a) consider prior elicitation on 
the basis of annual return levels rather than in terms of the GEV or GP parameters. 
It can indeed be argued that hydrological experts are probably more familiar with 
quantile specification rather than parameter specification. Then, in case of GEV 
modelling of maxima, one specifies prior information in terms of (q pi , q P2 , q P3 ) 
with p\ > p2 > P3 where 

+ iog(i - p)Y Y - i^). (11.14) 
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Since the quantiles q pi , q P2 , q P3 have to be ordered, Coles and Tawn considered 
independent gamma ( 01 “ fii) (/ = 1, 2, 3) priors on 

^1 = Qp\ » 豆 2 = qP 2 — q p \ ， ^3 = ~ ^lp2- 

The hyperparameters at , ^ were determined by measures of location and variability 
in prior belief. The experts were asked to specify the median and 90% quantiles 
of each of the 系 ， from which the gamma parameter estimates were obtained. In 
the rainfall example considered in Coles and Tawn (1996a), the values p\ = 0.1, 
P 2 = 0.01 and = 0.001 were chosen. From the prior specification, the joint 
prior for the q Pi is obtained as 

兀 （办 1 ，办 2 , 办 3 ) a — qpw) 1 exp( — pi{qp i — 

where q po = 0 and with 0 < q pi < q P2 < q P3 . Substituting the quantile expression 
(11.14) in this prior for (q pi , q P2 , q P3 ) and multiplying by the Jacobian of the 
transformation (q pi , q P2 , q P3 ) 4 0 = (ji, cr, y) leads directly to an expression for 
the prior in terms of the GEV parameters. 

11.6 An Environmental Application 

We end by illustrating the use of Bayesian modelling of extreme values with 
an example from an environmental context. It concerns the Bayesian modelling 
of wind-speed measurements from three different locations in Cape Town (South 
Africa). Figure 11.11 contains the boxplots of the monthly maximal wind gust 
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Figure 11.11 Monthly maximal wind gust measurements at Cape Town. 
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Figure 11.12 (continued) 

measurements at Cape Town harbour, airport and Robben Island respectively. Let 
Yij denote the maximum wind gust measurement of month j, j = 1,, 70， at 
location /, i = 1, 2, 3. Following condition (C y ) 

^i,j \ ^i 5 Yi^ 〜 GEV(cri ， yi, ， j = 1,..., 70, i — 1, 2, 3. 

The parameter vectors (cr ? -, yi, /if), i = 1, 2, 3, are assumed to be i.i.d. according to 
(log a ， logy ， /x) 〜 # 3 (0 ， S )， 

with (// = (3, 一 2, 40) and S = IO/ 3 . The proposed prior distribution reflects the 
beliefs of the harbour master. We use the Metropolis-Hastings algorithm to sim¬ 
ulate the posterior distribution of the model parameters. The proposal density is 
taken as a multivariate normal on (log 07 , log yi, /x,), i = 1, 2, 3, with independent 
components and respective standard deviations (0.03, 0.03, 0.03). Initializing with 
(log Oi , log yi, fii) = (3, —2, 40), / = 1, 2, 3, the values generated by 10,000 itera¬ 
tions of the chain are plotted in Figures 11.12-11.14. Note that the three locations 
are quite similar with respect to the tail index y. The (heavy-tailed) posterior dis¬ 
tributions of yi, y 2 and 3/3 have a median around 0.135. The differences between 
the monthly maximal wind gust distributions manifest themselves through the pos¬ 
terior distributions of the parameters /x and a. Finally, in Figure 11.15, we show 
the posterior median and 95% hpd region of qy, p as a function of p for the three 
locations. The quantiles of the maximal wind gust distribution clearly tend to be 
largest at the harbour. 
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Prior elicitation, 430 
Probability integral transform, 109, 
265 

Probability-probability (pp-) plot, 
12-14, 63 

Probability-weighted moments, 

133-138, 150, 159, 162 

Process 

compound Poisson, 170, 382, 
394 

linear, 425 

max-autoregressive (ARMAX), 
374, 384 

max-stable, 300, 311, 366 
moving-maximum, 369 
multivariate, 423 
multivariate stationary, 419 
non-stationary, 428 


point, 147, 169-171, 280, 285, 
334-336, 362-365 
382-400 

marked, 170, 382 
two-dimensional, 399 
Poisson, 147, 170, 362 
Profile likelihood, 138, 160, 220, 
226 

Pseudo-polar coordinates, 258, 284 
Quantile 

estimation, 10, 90, 163 
function, 1, 48 
local plot, 240 
regression method, 227, 232 
Quantile-quantile (qq-)plot, 3-11, 
63, 101 

Rainfall data, 19-21, 251, 301 
Ratio estimator, 103, 428 
Rayleigh distribution, 92 
Regression methods, 209-250 
Regular variation, 49, 77-82. See 
also Second order regular 
variation 

multivariate, 283, 287 
Reinsurance, 29-31, 188-199, 
241-250 
Representation 

dependence function, 270 
extreme-value distributions, 
293-295 
log-spacings, 110 
Renyi, 103, 109, 155, 166 
slowly varying functions, 78 
spectral measure, 258, 300 
Resampling, 126 
Residual quantile plot, 215 
with covariates, 225-233 
Response variable, 210 
Return period 
definition, 21 
estimation, 10 
Risk measures, 32 
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River discharges, 21, 51-55, 135, 
138, 251, 436-440 

Score statistic, 319 
Second order regular variation 
90-94, 97 

Slow variation, 49, 346. See also 
Regular variation 
bivariate, 354 
Spatial extremes, 366 
Spearman’s rho, 274 
Spectral 

decomposition, 258 
density, 262, 292 
function, 265 

measure, 258, 284, 288, 301, 
402 

estimation, 328, 358-361 
Stable distribution, 45 
Standard & Poors 500, 32 
Stirling’s formula, 95 

Tail array sums, 399 
Tail chain, 404 

Tail probability, see Exceedance 
probabilities 

Tail quantile function, 48, 77 
Tail sample size, 123-129 

Frechet-Pareto case, 123-129 
general case, 167 
optimal, 174 

t- distribution, 59, 60, 106 
Temperature data, 21-24, 251, 375, 
395-398, 413-418 


Theorem 

central limit, 85 
Fisher-Tippett-Gnedenko, 75 
de Haan-Stadtmliller, 91 
Helly-Bray, 47, 84 
Karamata, 78 

law of large numbers, 47, 86 
Lindeberg-Feller, 112 
uniform convergence, 78 
Von Mises, 60, 69, 73 
Threshold dependence, 144 
Triangular array, 290, 311 

Uniform distribution, 65, 68, 69 
Unit simplex, 260 

Vague convergence, 280 
Value-at-risk (VaR), 32 
Von Mises distribution, 311 

(extremal) Weibull class, 65-69 
(classical) Weibull distribution, 11, 
16, 72, 73, 92 

Weighted least squares estimator, 
107 

Weissman estimator, 120, 125, 441 
Wicksell’s corpuscule problem, 40 
Wind speed data, 1, 7, 16, 21-24, 
144, 301, 311, 452-459 

Zipf 

estimator, 144, 173, 174 
law, 42 


~KOLXO3~ 

1:02 pm, 12/12/05 
V_ [___ J 



WILEY SERIES IN PROBABILITY AND STATISTICS 


ESTABLISHED BY WALTER A. SHEWHART AND SAMUEL S. WILKS 

Editors: David J. Balding, Noel A. C. Cressie, Nicholas I. Fisher, 

Iain M. Johnstone, J. B. Kadane, Geert Molenberghs, Louise M. Ryan, 

David W. Scott, Adrian F. M. Smith, Jozef L. Teugels 

Editors Emeriti: Vic Barnett ， J. Stuart Hunter, David G. Kendall 


The Wiley Series in Probability and Statistics is well established and authoritative. It covers many 
topics of current research interest in both pure and applied statistics and probability theory. Written by 
leading statisticians and institutions, the titles span both state-of-the-art developments in the field and 
classical methods. 

Reflecting the wide range of current research in statistics, the series encompasses applied, method¬ 
ological and theoretical statistics, ranging from applications and new techniques made possible by 
advances in computerized practice to rigorous treatment of theoretical approaches. 

This series provides essential and invaluable reading for all statisticians, whether in academia, 
industry, government, or research. 


ABRAHAM and LEDOLTER • Statistical Methods for Forecasting 
AGRESTI • Analysis of Ordinal Categorical Data 
AGRESTI • An Introduction to Categorical Data Analysis 
AGRESTI • Categorical Data Analysis, Second Edition 

ALTMAN, GILL, and McDONALD . Numerical Issues in Statistical Computing 
for the Social Scientist 

AMARATUNGA and CABRERA - Exploration and Analysis of DNA Microarray and Protein Array 
Data 

ANDEL - Mathematics of Chance 

ANDERSON . An Introduction to Multivariate Statistical Analysis, Third Edition 

* ANDERSON - The Statistical Analysis of Time Series 

ANDERSON, AUQUIER, HAUCK, OAKES, VANDAELE, and WEISBERG - 
Statistical Methods for Comparative Studies 
ANDERSON and LOYNES . The Teaching of Practical Statistics 
ARMITAGE and DAVID (editors) . Advances in Biometry 
ARNOLD, BALAKRISHNAN, and NAGARAJA . Records 
*ARTHANARI and DODGE • Mathematical Programming in Statistics 

* BAILEY - The Elements of Stochastic Processes with Applications to the Natural 

Sciences 

BALAKRISHNAN and KOUTRAS - Runs and Scans with Applications 
BARNETT - Comparative Statistical Inference, Third Edition 
BARNETT - Environmental Statistics: Methods & Applications 
BARNETT and LEWIS • Outliers in Statistical Data, Third Edition 
BARTOSZYNSKI and NIEWIADOMSKA-BUGAJ • Probability and Statistical Inference 
BASILEVSKY - Statistical Factor Analysis and Related Methods: Theory and 
Applications 

BASU and RIGDON • Statistical Methods for the Reliability of Repairable Systems 
BATES and WATTS • Nonlinear Regression Analysis and Its Applications 
BECHHOFER, SANTNER, and GOLDSMAN . Design and Analysis of Experiments for 
Statistical Selection, Screening, and Multiple Comparisons 
BELSLEY . Conditioning Diagnostics: Collinearity and Weak Data in Regression 
BELSLEY, KUH, and WELSCH • Regression Diagnostics: Identifying Influential 
Data and Sources of Collinearity 


*Now available in a lower priced paperback edition in the Wiley Classics Library. 
Statistics of Extremes: Theory and Applications J. Beirlant, Y. Goegebeur, J. Segers, and J. Teugels 
© 2004 John Wiley & Sons, Ltd ISBN: 0-471-97647-4 



BEND AT and PIERSOL • Random Data: Analysis and Measurement Procedures, 

Third Edition 

BERNARDO and SMITH • Bayesian Theory 

BERRY, CHALONER, and GEWEKE - Bayesian Analysis in Statistics and 
Econometrics: Essays in Honor of Arnold Zellner 
BHAT and MILLER • Elements of Applied Stochastic Processes, Third Edition 
BHATTACHARYA and JOHNSON . Statistical Concepts and Methods 
BHATTACHARYA and WAYMIRE . Stochastic Processes with Applications 
BILLINGSLEY . Convergence of Probability Measures, Second Edition 
BILLINGSLEY . Probability and Measure, Third Edition 
BIRKES and DODGE • Alternative Methods of Regression 

BLISCHKE AND MURTHY (editors) • Case Studies in Reliability and Maintenance 

BLISCHKE AND MURTHY • Reliability: Modeling, Prediction, and Optimization 

BLOOMFIELD - Fourier Analysis of Time Series: An Introduction, Second Edition 

BOLLEN • Structural Equations with Latent Variables 

BOROVKOV - Ergodicity and Stability of Stochastic Processes 

BOULEAU • Numerical Methods for Stochastic Processes 

BOX • Bayesian Inference in Statistical Analysis 

BOX • R. A. Fisher, the Life of a Scientist 

BOX and DRAPER • Empirical Model-Building and Response Surfaces 
*BOX and DRAPER • Evolutionary Operation: A Statistical Method for Process 
Improvement 

BOX, HUNTER, and HUNTER . Statistics for Experimenters: An Introduction to 
Design, Data Analysis, and Model Building 
BOX and LUCENO - Statistical Control by Monitoring and Feedback Adjustment 
BRANDIMARTE • Numerical Methods in Finance: A MATLAB-Based Introduction 
BROWN and HOLLANDER • Statistics: A Biomedical Introduction 
BRUNNER, DOMHOF, and LANGER • Nonparametric Analysis of Longitudinal 
Data in Factorial Experiments 

BUCKLEW - Large Deviation Techniques in Decision, Simulation, and Estimation 
CAIROLI and DALANG • Sequential Stochastic Optimization 
CHAN • Time Series: Applications to Finance 

CHATTERJEE and HADI • Sensitivity Analysis in Linear Regression 
CHATTERJEE and PRICE • Regression Analysis by Example, Third Edition 
CHERNICK - Bootstrap Methods: A Practitioner’s Guide 
CHERNICK and FRIIS • Introductory Biostatistics for the Health Sciences 
CHILES and DELFINER - Geostatistics: Modeling Spatial Uncertainty 

CHOW and LIU • Design and Analysis of Clinical Trials: Concepts and Methodologies, Second Edition 
CLARKE and DISNEY • Probability and Random Processes: A First Course with 
Applications, Second Edition 

* COCHRAN and COX • Experimental Designs, Second Edition 
CONGDON • Applied Bayesian Modelling 
CONGDON • Bayesian Statistical Modelling 
CONOVER • Practical Nonparametric Statistics, Second Edition 
COOK • Regression Graphics 

COOK and WEISBERG • Applied Regression Including Computing and Graphics 
COOK and WEISBERG • An Introduction to Regression Graphics 
CORNELL - Experiments with Mixtures, Designs, Models, and the Analysis of Mixture 
Data, Third Edition 

COVER and THOMAS . Elements of Information Theory 
COX • A Handbook of Introductory Statistical Methods 
*COX - Planning of Experiments 
CRESSIE - Statistics for Spatial Data, Revised Edition 
CSORGO and HORVATH - Limit Theorems in Change Point Analysis 
DANIEL • Applications of Statistics to Industrial Experimentation 
DANIEL - Biostatistics: A Foundation for Analysis in the Health Sciences, Sixth Edition 

*Now available in a lower priced paperback edition in the Wiley Classics Library. 



* DANIEL - Fitting Equations to Data: Computer Analysis of Multifactor Data, 

Second Edition 

DASU and JOHNSON - Exploratory Data Mining and Data Cleaning 
DAVID and NAGARAJA • Order Statistics, Third Edition 
*DEGROOT, FIENBERG, and KADANE - Statistics and the Law 
DEL CASTILLO . Statistical Process Adjustment for Quality Control 

DENISON, HOLMES, MALLICK and SMITH Bayesian Methods for Nonlinear Classification 
and Regression 

DETTE and STUDDEN • The Theory of Canonical Moments with Applications in 
Statistics, Probability, and Analysis 
DEY and MUKERJEE - Fractional Factorial Plans 

DILLON and GOLDSTEIN - Multivariate Analysis: Methods and Applications 
DODGE • Alternative Methods of Regression 

* DODGE and ROMIG - Sampling Inspection Tables, Second Edition 
*DOOB . Stochastic Processes 

DOWDY and WEARDEN, and CHILKO - Statistics for Research, Third Edition 
DRAPER and SMITH • Applied Regression Analysis, Third Edition 
DRYDEN and MARDIA • Statistical Shape Analysis 
DUDEWICZ and MIS HR A • Modem Mathematical Statistics 

DUNN and CLARK • Applied Statistics: Analysis of Variance and Regression, Second Edition 
DUNN and CLARK • Basic Statistics: A Primer for the Biomedical Sciences, 

Third Edition 

DUPUIS and ELLIS • A Weak Convergence Approach to the Theory of Large Deviations 
*ELANDT-JOHNSON and JOHNSON - Survival Models and Data Analysis 
ENDERS . Applied Econometric Time Series 

ETHIER and KURTZ - Markov Processes: Characterization and Convergence 
EVANS, HASTINGS, and PEACOCK . Statistical Distributions, Third Edition 
FELLER - An Introduction to Probability Theory and Its Applications, Volume I, 

Third Edition, Revised; Volume II, Second Edition 
FISHER and VAN BELLE • Biostatistics: A Methodology for the Health Sciences 
*FLEISS • The Design and Analysis of Clinical Experiments 
FLEISS - Statistical Methods for Rates and Proportions, Second Edition 
FLEMING and HARRINGTON - Counting Processes and Survival Analysis 
FULLER - Introduction to Statistical Time Series, Second Edition 
FULLER - Measurement Error Models 
GALLANT • Nonlinear Statistical Models 

GELMAN and MENG (editors). Applied Bayesian Modeling and Causal Inference from 
Incomplete-Data Perspectives 

GHOSH, MUKHOPADHYAY, and SEN • Sequential Estimation 
GIESBRECHT and GUMPERTZ • Planning, Construction, and Statistical Analysis 
of Comparative Experiments 
GIFI • Nonlinear Multivariate Analysis 

GLASSERMAN and YAO - Monotone Structure in Discrete-Event Systems 
GNANADESIKAN . Methods for Statistical Data Analysis of Multivariate Observations, 

Second Edition 

GOLDSTEIN and LEWIS • Assessment: Problems, Development, and Statistical Issues 
GREENWOOD and NIKULIN - A Guide to Chi-Squared Testing 
GROSS and HARRIS • Fundamentals of Queueing Theory, Third Edition 
*HAHN and SHAPIRO • Statistical Models in Engineering 
HAHN and MEEKER - Statistical Intervals: A Guide for Practitioners 
HALD - A History of Probability and Statistics and their Applications Before 1750 
HALD • A History of Mathematical Statistics from 1750 to 1930 
HAMPEL • Robust Statistics: The Approach Based on Influence Functions 
HANNAN and DEISTLER • The Statistical Theory of Linear Systems 
HEIBERGER • Computation for the Analysis of Designed Experiments 


*Now available in a lower priced paperback edition in the Wiley Classics Library. 



HEDAYAT and SINHA • Design and Inference in Finite Population Sampling 
HELLER • MACSYMA for Statisticians 

HINKELMAN and KEMPTHORNE: - Design and Analysis of Experiments, Volume 1: 
Introduction to Experimental Design 

HOAGLIN, MOSTELLER, and TUKEY - Exploratory Approach to Analysis 
of Variance 

HOAGLIN, MOSTELLER, and TUKEY - Exploring Data Tables, Trends and Shapes 
*HOAGLIN, MOSTELLER, and TUKEY - Understanding Robust and Exploratory Data Analysis 
HOCHBERG and TAMHANE • Multiple Comparison Procedures 
HOCKING - Methods and Applications of Linear Models: Regression and the Analysis 
of Variance, Second Edition 

HOEL • Introduction to Mathematical Statistics, Fifth Edition 
HOGG and KLUGMAN - Loss Distributions 

HOLLANDER and WOLFE • Nonparametric Statistical Methods, Second Edition 
HOSMER and LEMESHOW . Applied Logistic Regression, Second Edition 
HOSMER and LEMESHOW • Applied Survival Analysis: Regression Modeling of 
Time to Event Data 
HUBER . Robust Statistics 
HUBERTY • Applied Discriminant Analysis 

HUNT and KENNEDY - Financial Derivatives in Theory and Practice, Revised Edition 
HUSKOVA, BERAN, and DUPAC • Collected Works of Jaroslav Hajek— 
with Commentary 

IMAN and CONOVER . A Modem Approach to Statistics 
JACKSON • A User’s Guide to Principle Components 
JOHN • Statistical Methods in Engineering and Quality Assurance 
JOHNSON - Multivariate Statistical Simulation 

JOHNSON and BALAKRISHNAN - Advances in the Theory and Practice of Statistics: A 
Volume in Honor of Samuel Kotz 

JUDGE, GRIFFITHS, HILL, LUTKEPOHL, and LEE . The Theory and Practice of 
Econometrics, Second Edition 
JOHNSON and KOTZ . Distributions in Statistics 

JOHNSON and KOTZ (editors) • Leading Personalities in Statistical Sciences: From the 
Seventeenth Century to the Present 

JOHNSON, KOTZ, and BALAKRISHNAN - Continuous Univariate Distributions, 

Volume 1, Second Edition 

JOHNSON, KOTZ, and BALAKRISHNAN - Continuous Univariate Distributions, 

Volume 2, Second Edition 

JOHNSON, KOTZ, and BALAKRISHNAN - Discrete Multivariate Distributions 
JOHNSON, KOTZ, and KEMP - Univariate Discrete Distributions, Second Edition 
JURECKOVA and SEN • Robust Statistical Procedures: Asymptotics and Interrelations 
JUREK and MASON • Operator-Limit Distributions in Probability Theory 
KADANE - Bayesian Methods and Ethics in a Clinical Trial Design 
KADANE AND SCHUM - A Probabilistic Analysis of the Sacco and Vanzetti Evidence 
KALBFLEISCH and PRENTICE - The Statistical Analysis of Failure Time Data Second 
Edition 

KARIYA and KURATA . Generalized Least Squares 
KASS and VOS • Geometrical Foundations of Asymptotic Inference 
KAUFMAN and ROUSSEEUW - Finding Groups in Data: An Introduction to Cluster 
Analysis 

KEDEM and FOKIANOS • Regression Models for Time Series Analysis 

KENDALL, BARDEN, CARNE, and LE . Shape and Shape Theory 

KHURI - Advanced Calculus with Applications in Statistics, Second Edition 

KHURI, MATHEW, and SINHA • Statistical Tests for Mixed Linear Models 

KLEIBER and KOTZ • Statistical Size Distributions in Economics and Actuarial Sciences 

KLUGMAN, PANJER, and WILLMOT - Loss Models: From Data to Decisions 


*Now available in a lower priced paperback edition in the Wiley Classics Library. 




KLUGMAN, PANJER, and WILLMOT . Solutions Manual to Accompany Loss Models: 
From Data to Decisions 

KOTZ, BALAKRISHNAN, and JOHNSON . Continuous Multivariate Distributions, 
Volume 1, Second Edition 

KOTZ and JOHNSON (editors) • Encyclopedia of Statistical Sciences: Volumes 1 to 9 
with Index 

KOTZ and JOHNSON (editors) • Encyclopedia of Statistical Sciences: Supplement 
Volume 

KOTZ, READ, and BANKS (editors) - Encyclopedia of Statistical Sciences: Update 
Volume 1 

KOTZ, READ, and BANKS (editors) - Encyclopedia of Statistical Sciences: Update 
Volume 2 

KOVALENKO, KUZNETZOV, and PEGG - Mathematical Theory of Reliability of 
Time-Dependent Systems with Practical Applications 
LACHIN • Biostatistical Methods: The Assessment of Relative Risks 
LAD • Operational Subjective Statistical Methods: A Mathematical, Philosophical, and 
Historical Introduction 

LAMPERTI . Probability: A Survey of the Mathematical Theory, Second Edition 
LANGE, RYAN, BILLARD, BRILLINGER, CONQUEST, and GREENHOUSE . 

Case Studies in Biometry 

LARSON • Introduction to Probability Theory and Statistical Inference, Third Edition 
LAWLESS • Statistical Models and Methods for Lifetime Data, Second Edition 
LAWSON • Statistical Methods in Spatial Epidemiology 
LE - Applied Categorical Data Analysis 
LE • Applied Survival Analysis 

LEE and WANG • Statistical Methods for Survival Data Analysis, Third Edition 

LePAGE and BILLARD . Exploring the Limits of Bootstrap 

LEYLAND and GOLDSTEIN (editors) • Multilevel Modelling of Health Statistics 

LIAO • Statistical Group Comparison 

LINDVALL • Lectures on the Coupling Method 

LINHART and ZUCCHINI • Model Selection 

LITTLE and RUBIN • Statistical Analysis with Missing Data, Second Edition 
LLOYD • The Statistical Analysis of Categorical Data 

MAGNUS and NEUDECKER - Matrix Differential Calculus with Applications in 
Statistics and Econometrics, Revised Edition 
MALLER and ZHOU • Survival Analysis with Long Term Survivors 
MALLOWS • Design, Data, and Analysis by Some Friends of Cuthbert Daniel 
MANN, SCHAFER, and SINGPURWALLA • Methods for Statistical Analysis of 
Reliability and Life Data 

MANTON, WOODBURY, and TOLLEY . Statistical Applications Using Fuzzy Sets 
MARCHETTE - Random Graphs for Statistical Pattern Recognition 
MARDIA and JUPP • Directional Statistics 

MASON, GUNST, and HESS • Statistical Design and Analysis of Experiments with 
Applications to Engineering and Science, Second Edition 
McCULLOCH and SEARLE - Generalized, Linear, and Mixed Models 
McFADDEN - Management of Data in Clinical Trials 
McLACHLAN • Discriminant Analysis and Statistical Pattern Recognition 
McLACHLAN and KRISHNAN - The EM Algorithm and Extensions 
McLACHLAN and PEEL • Finite Mixture Models 
McNEIL • Epidemiological Research Methods 
MEEKER and ESCOBAR • Statistical Methods for Reliability Data 
MEERSCHAERT and SCHEFFLER • Limit Distributions for Sums of Independent 
Random Vectors: Heavy Tails in Theory and Practice 
* MILLER . Survival Analysis, Second Edition 
MONTGOMERY, PECK, and VINING . Introduction to Linear Regression Analysis, 
Third Edition 


*Now available in a lower priced paperback edition in the Wiley Classics Library. 



MORGENTHALER and TUKEY - Configural Poly sampling: A Route to Practical 
Robustness 

MUIRHEAD • Aspects of Multivariate Statistical Theory 
MURRAY . X-STAT 2.0 Statistical Experimentation, Design Data Analysis, and 
Nonlinear Optimization 
MURTHY, XIE, and JIANG • Weibull Models 

MYERS and MONTGOMERY • Response Surface Methodology: Process and Product 
Optimization Using Designed Experiments, Second Edition 
MYERS, MONTGOMERY, and VINING . Generalized Linear Models. With 
Applications in Engineering and the Sciences 
NELSON . Accelerated Testing, Statistical Models, Test Plans, and Data Analyses 
NELSON . Applied Life Data Analysis 
NEWMAN - Biostatistical Methods in Epidemiology 

OCHI • Applied Probability and Stochastic Processes in Engineering and Physical 
Sciences 

OK ABE, BOOTS, SUGIHARA, and CHIU • Spatial Tesselations: Concepts and 
Applications of Voronoi Diagrams, Second Edition 
OLIVER and SMITH • Influence Diagrams, Belief Nets and Decision Analysis 
PALTA - Quantitative Methods in Population Health: Extensions of Ordinary 
Regressions 

PANKRATZ • Forecasting with Dynamic Regression Models 

PANKRATZ • Forecasting with Univariate Box-Jenkins Models: Concepts and Cases 
*PARZEN . Modem Probability Theory and It’s Applications 
PENA, TIAO, and TSAY . A Course in Time Series Analysis 
PIANTADOSI • Clinical Trials: A Methodologic Perspective 
PORT . Theoretical Probability for Applications 

POURAHMADI - Foundations of Time Series Analysis and Prediction Theory 
PRESS • Bayesian Statistics: Principles, Models, and Applications 
PRESS - Subjective and Objective Bayesian Statistics, Second Edition 
PRESS and TANUR . The Subjectivity of Scientists and the Bayesian Approach 
PUKELSHEIM - Optimal Experimental Design 

PURI, VILAPLANA, and WERTZ . New Perspectives in Theoretical and Applied 
Statistics 

PUTERMAN - Markov Decision Processes: Discrete Stochastic Dynamic Programming 
*RAO • Linear Statistical Inference and Its Applications, Second Edition 
RAUSAND and H0YLAND . System Reliability Theory: Models, Statistical Methods and Applications, 
Second Edition 

RENCHER - Linear Models in Statistics 

RENCHER - Methods of Multivariate Analysis, Second Edition 

RENCHER - Multivariate Statistical Inference with Applications 

RIPLEY - Spatial Statistics 

RIPLEY - Stochastic Simulation 

ROBINSON • Practical Strategies for Experimenting 

ROHATGI and SALEH - An Introduction to Probability and Statistics, Second Edition 
ROLSKI, SCHMIDLI, SCHMIDT, and TEUGELS - Stochastic Processes for Insurance 
and Finance 

ROSENBERGER and LACHIN • Randomization in Clinical Trials: Theory and Practice 

ROSS • Introduction to Probability and Statistics for Engineers and Scientists 

ROUSSEEUW and LEROY . Robust Regression and Outlier Detection 

RUBIN . Multiple Imputation for Nonresponse in Surveys 

RUBINSTEIN - Simulation and the Monte Carlo Method 

RUBINSTEIN and MELAMED • Modem Simulation and Modeling 

RYAN . Modem Regression Methods 

RYAN • Statistical Methods for Quality Improvement, Second Edition 
SALTELLI, CHAN, and SCOTT (editors) • Sensitivity Analysis 
*SCHEFFE - The Analysis of Variance 

*Now available in a lower priced paperback edition in the Wiley Classics Library. 



SCHIMEK • Smoothing and Regression: Approaches, Computation, and 
Application 

SCHOTT • Matrix Analysis for Statistics 

SCHOUTENS • Levy Processes in Finance: Pricing Financial Derivatives 
SCHUSS • Theory and Applications of Stochastic Differential Equations 
SCOTT • Multivariate Density Estimation: Theory, Practice, and Visualization 
*SEARLE - Linear Models 
SEARLE - Linear Models for Unbalanced Data 
SEARLE - Matrix Algebra Useful for Statistics 
SEARLE, CASELLA, and McCULLOCH . Variance Components 
SEARLE and WILLETT • Matrix Algebra for Applied Economics 
SEBER • Multivariate Observations 

SEBER and LEE - Linear Regression Analysis, Second Edition 
SEBER and WILD - Nonlinear Regression 

SENNOTT • Stochastic Dynamic Programming and the Control of Queueing Systems 
*SERFLING - Approximation Theorems of Mathematical Statistics 
SHAFER and VOVK • Probability and Finance: Its Only a Game! 

SMALL and McLEISH • Hilbert Space Methods in Probability and Statistical Inference 

SRIVASTAVA - Methods of Multivariate Statistics 

STAPLETON • Linear Statistical Models 

STAUDTE and SHEATHER . Robust Estimation and Testing 

STOYAN, KENDALL, and MECKE - Stochastic Geometry and Its Applications, Second 
Edition 

STOYAN and STOYAN • Fractals, Random Shapes and Point Fields: Methods of 
Geometrical Statistics 

STYAN - The Collected Papers of T. W. Anderson: 1943-1985 
SUTTON, ABRAMS, JONES, SHELDON, and SONG - Methods for Meta- 
Analysis in Medical Research 

TANAKA • Time Series Analysis: Nonstationary and Noninvertible Distribution Theory 

THOMPSON • Empirical Model Building 

THOMPSON . Sampling, Second Edition 

THOMPSON • Simulation: A Modeler’s Approach 

THOMPSON and SEBER . Adaptive Sampling 

THOMPSON, WILLIAMS, and FINDLAY . Models for Investors in Real World Markets 
TIAO, BISGAARD, HILL, PENA, and STIGLER (editors) . Box on Quality and 
Discovery: with Design, Control, and Robustness 
TIERNEY . LISP-STAT: An Object-Oriented Environment for Statistical 
Computing and Dynamic Graphics 
TSAY • Analysis of Financial Time Series 

UPTON and FINGLETON - Spatial Data Analysis by Example, Volume II: 

Categorical and Directional Data 
VAN BELLE . Statistical Rules of Thumb 
VESTRUP - The Theory of Measures and Integration 
VIDAKOVIC . Statistical Modeling by Wavelets 
WEISBERG - Applied Linear Regression, Second Edition 
WELSH - Aspects of Statistical Inference 

WESTFALL and YOUNG • Resampling-Based Multiple Testing: Examples and 
Methods for /?-Value Adjustment 

WHITTAKER - Graphical Models in Applied Multivariate Statistics 
WINKER - Optimization Heuristics in Economics: Applications of Threshold 
Accepting 

WONNACOTT and WONNACOTT . Econometrics, Second Edition 
WOODING • Planning Pharmaceutical Clinical Trials: Basic Statistical Principles 
WOOLS ON and CLARKE - Statistical Methods for the Analysis of Biomedical Data, 
Second Edition 


: Now available in a lower priced paperback edition in the Wiley Classics Library. 



WU and HAMADA • Experiments: Planning, Analysis, and Parameter Design 
Optimization 

YANG • The Construction Theory of Denumerable Markov Processes 
*ZELLNER - An Introduction to Bayesian Inference in Econometrics 
ZELTERMAN - Discrete Distributions: Applications in the Health Sciences 
ZHOU, OBUCHOWSKI, and McCLISH - Statistical Methods in Diagnostic 
Medicine 


: Now available in a lower priced paperback edition in the Wiley Classics Library. 



